We're hiring
Back to blog

Understanding the New Categorical

By Thijs Nieuwdorp on Thu, 29 Jan 2026

Polars rapidly improves under the hood. Last year we built a new streaming engine and we are working hard to build a high-performance distributed engine for the Polars API. This put a strain on earlier design decisions for the Categorical date type.

Polars’ streaming engine processes data in parallel morsels (chunks). Every morsel carried its own mapping table, forcing constant syncing and re-encoding. The alternative, a global StringCache, created locks and pauses in the streaming pipeline. And in a distributed architecture this “global” per process approach was not truly global, but something we wanted to avoid.

Because of this, we rebuilt how the Categorical data type works. (#23016). In this post we’ll explain why, and what we did to fix it.

The Initial Implementation

Categorical is a data type used to represent a column that holds string values with a limited number of possible values. It is useful when the number of unique values is much smaller than the total number of entries in the column. This property can be used to store it in a more efficient manner than plain Strings. Instead it is stored as a smaller numerical type, and a mapping is kept, mapping this underlying “physical type” (like an 8-bit unsigned int) to its original String representation.

Let’s create a DataFrame containing a Series with the Categorical type. We’ll do this in Polars 1.31.0, which is the release before the reworked Categorical.

bears = (
    pl.DataFrame({
        "bears": ["Polar", "Grizzly", "Sun", "Polar"]
    })
    .cast(
        pl.Categorical()
    )
)
print(
    bears.with_columns(
        pl.col("bears").to_physical().alias("physical")
    )
)

# Output
shape: (4, 2)
┌─────────┬──────────┐
│ bears   ┆ physical │
│ ---     ┆ ---
│ cat     ┆ u32      │
╞═════════╪══════════╡
│ Polar   ┆ 0
│ Grizzly ┆ 1
│ Sun     ┆ 2
│ Polar   ┆ 0
└─────────┴──────────┘

In this DataFrame the bears Series is of the data type Categorical. It has its own mapping between values of the physical type (uint32) and their String representation. For example, you can see that the Polar bear is encoded with a 0.

When you created another DataFrame with a Categorical column, it created a new mapping that is stored locally with the new Series.

other_bears = (
    pl.DataFrame({
        "bears": ["Brown", "Spectacled", "Polar", "Sloth"]
    })
    .cast(
        pl.Categorical()
    )
)

print(
    other_bears.with_columns(
        pl.col("bears").to_physical().alias("physical")
    )
)

# Output
┌────────────┬──────────┐
│ bears      ┆ physical │
│ ---        ┆ ---
│ cat        ┆ u32      │
╞════════════╪══════════╡
│ Brown      ┆ 0
│ Spectacled ┆ 1
│ Polar      ┆ 2
│ Sloth      ┆ 3
└────────────┴──────────┘

Here we see that the Polar bear is represented with a physical value of 2.

When you tried to combine these two DataFrames, you ran into the following:

pl.concat([bears, other_bears]).with_columns(
    pl.col("bears").to_physical().alias("physical")
)
<sys>:0: CategoricalRemappingWarning: Local categoricals have different encodings, 
expensive re-encoding is done to perform this merge operation. 
Consider using a StringCache or an Enum type if the categories are known in advance

# Output
shape: (8, 2)
┌────────────┬──────────┐
│ bears      ┆ physical │
│ ---        ┆ ---
│ cat        ┆ u32      │
╞════════════╪══════════╡
│ Polar      ┆ 0
│ Grizzly    ┆ 1
│ Sun        ┆ 2
│ Polar      ┆ 0
│ Brown      ┆ 3
│ Spectacled ┆ 4
│ Polar      ┆ 0
│ Sloth      ┆ 5
└────────────┴──────────┘

In order to combine these Categoricals their mappings needed to be synced to make sure every category is properly represented in this new merged mapping. Since this re-encoding is an expensive operation, we throw a warning. In it we advise using a StringCache, which is a global mapping that allows all DataFrames to tap into it.

Using the StringCache, physical values were synced across different DataFrames:

with pl.StringCache():
    bears = ...
    other_bears = ...
    print(
        bears.with_columns(
            pl.col("bears").to_physical().alias("physical")
        ),
        other_bears.with_columns(
            pl.col("bears").to_physical().alias("physical")
        )
    )

# Output
shape: (4, 2)
┌─────────┬──────────┐
│ bears   ┆ physical │
│ ---     ┆ ---
│ cat     ┆ u32      │
╞═════════╪══════════╡
│ Polar   ┆ 0
│ Grizzly ┆ 1
│ Sun     ┆ 2
│ Polar   ┆ 0
└─────────┴──────────┘ 
shape: (4, 2)
┌────────────┬──────────┐
│ bears      ┆ physical │
│ ---        ┆ ---
│ cat        ┆ u32      │
╞════════════╪══════════╡
│ Brown      ┆ 3
│ Spectacled ┆ 4
│ Polar      ┆ 0
│ Sloth      ┆ 5
└────────────┴──────────┘

As you can see, the Polar category is now in sync across both DataFrames, being represented as 0 in both. This StringCache method made sure to keep the values in sync, allowing or combining different Series without re-encoding the Categorical. This solved the problem of keeping mappings in sync, although the StringCache still had some inefficiencies in the way it was built. These inefficiences were exposed by the streaming engine.

The Incompatibilities

The way the streaming engine works is that it processes data in chunks (called morsels). It chops the input data into morsels that are processed individually in a pipeline. Among other things, it allows Polars to start processing data while it’s still simultaneously reading the rest from the disk, or the network.

With the old Categorical local implementation (where the mapping was kept per Series), every morsel would carry its own local mapping between values of the physical type and their String representation. Categorical encodings depend on the order of the input. The ordering of values in these morsels was almost always different. This meant that local mappings were almost always out of sync. They to be re-encoded constantly when they had to be recombined into intermittent or final results, which caused a heavy performance hit.

The alternative, a global StringCache, was also under pressure. Because of the parallel nature of Polars, many morsels were processed at the same time. Every time they were processed they would read or update the mapping in the global StringCache, which needed to grab locks on the object, causing pauses in the streaming pipeline. This effect was made larger when working with multiple Categoricals in a global StringCache, because they all pointed to the same mapping. This made Categoricals very slow to work with. Besides that, the StringCache didn’t work with a distributed architecture.

The New Implementation

In order to make them compatible with the streaming engine, we introduced a new implementation of Categoricals, using these lessons learned.

When creating a Categorical column, you can now provide a pl.Categories() object to define the mapping between the physical type and their String representation.

bear_categories =  pl.Categories(name="bears", namespace="org.polars", physical=pl.UInt8)
bears = (
    pl.DataFrame({
        "bears": ["Polar", "Grizzly", "Sun", "Polar"]
    })
    .cast(pl.Categorical(bear_categories))
)

Here you can see that Categories take the following arguments:

  • name: The name of the category.
  • namespace: The namespace of the category. In case you want multiple Categories with the same name, but different categories.
  • physical: The physical type of the category. This can be pl.UInt32 (default) which can encode over 4.2 million categories, pl.UInt16 which encodes 65.535, or pl.UInt8 which can encode 255. Choosing the right physical type impacts the memory usage and performance of your application.

They work like the global StringCache used to, but there are some key differences:

They can be named and given a namespace to allow for multiple mapping to coexist. The context wrapper (with pl.StringCache():) is also no longer required.

Categories match when they have the same name, namespace and physical backing type, even if they are created in separate calls to Categories. Their values are ordered lexicographically (alphabetically) based on the String representation. When there is no Categories provided, the Categorical will make use of a global Categories mapping. This global mapping shares its categories with other global Categories objects, much like how StringCache used to work.

Under the hood this constructs a CategoricalMapping. This is a bidirectional mapping that ties a value of the physical type to their String representation. It uses a custom string interner that enables parallel updates to and reads from the mapping. This means multiple threads can add a new string to the mapping without having to wait for each other (if they’re different strings). If the string already existed in the mapping the threads don’t cause contention at all. Neither of these things were possible with the old StringCache implementation.

And to top it off, elements of a CategoricalMapping are automatically garbage collected. This prevents it from only growing, even when there is no more data referencing it, which was the case for the StringCache if you didn’t manually release it.

The redesign incorporates our lessons learned and now allows for performant streaming engine support.

Enum

The Enum type was always strongly linked to the Categorical. In a Categorical the categories are dynamic. They are updated on the fly when new values appear in the Series that are not encoded yet.

However in an Enum, the categories are predefined and cannot be changed. This brings advantages in performance. In the new implementation, Enums use a FrozenCategoricalMapping, which is defined once, and immutable after that.

On top of that they allow you to define the ordering of Categories. This allows for optimizing on sorting, because it is done on the physical representation, and not using the String representation, whose sort is more expensive. You can see how it works in the example below with log levels:

log_levels = pl.Enum(["Critical", "Warning", "Info", "Debug"])
df = pl.DataFrame(
    {
        "log": ["Query took longer than usual", 
            "Finished downloading data", "Finished query", 
            "Result length: 50023", "Result had null values"],
        "level": ["Warning", "Info", "Info", "Debug", "Critical"]
    },
    schema={"log": pl.String, "level": log_levels}
)
df.sort("level")

# Output
shape: (5, 2)
┌──────────────────────────────┬──────────┐
│ log                          ┆ level    │
│ ---                          ┆ ---
str                          ┆ enum     │
╞══════════════════════════════╪══════════╡
│ Result had null values       ┆ Critical │
│ Query took longer than usual ┆ Warning
│ Finished downloading data    ┆ Info     │
│ Finished query               ┆ Info     │
│ Result length: 50023         ┆ Debug    │
└──────────────────────────────┴──────────┘

Wrap-up

The new Categorical implementation addresses the core performance bottlenecks that made the old design incompatible with Polars’ streaming engine and distributed architectures. By introducing Categories objects with contention-free string interning and automatic garbage collection, we’ve eliminated the expensive re-encoding operations and lock contention that plagued the previous implementation. In distributed queries, this re-encoding remains necessary, but Categoricals can still be worth it as they can drastically reduce shuffle sizes, although the recommendation remains to prefer Enum if you know the categories up front.

The redesign gives you more control over how Categoricals are managed in your pipelines. You can now define named categories with specific namespaces and physical types, making it easier to reason about memory usage and performance. And if you’re working with predefined, ordered categories, the Enum type provides an even more performant option, which we recommend.

Since the update, the StringCache has become a no-op, and because Polars makes use of a global Categories object by default, there are no behavioral changes. If you’re currently using StringCache in your code, the migration path is as follows: replace the context manager with explicit Categories objects where needed.

These improvements mean you can now confidently use Categoricals and Enums in streaming operations without worrying about performance degradation, making them a practical choice for large-scale data processing pipelines.

1
2
4
3
5
6
7
8
9
10
11
12