Why we have rewritten the string data type

Tue, 30 Jan 2024

Last couple of weeks I have been working on one, if not the, largest refactors of my life. We have completely rewritten the string/binary data structure we use in Polars. This was a refactor I wanted to do for a long time, but we couldn’t as several factors were not ready at the time.

Initially Polars was a consumer of the Arrow2 crate (native Rust implementation of the Arrow spec). This meant that we couldn’t change the string type as the goal of the Arrow2 project was to follow the complete Arrow spec. End of last year we forked parts of Arrow2 in the crate polars-arrow, which is a trimmed down implementation of the Arrow spec and is tuned for Polars’ needs.

Now we have the code under our control the refactor was imminent. As luck would have it, the Arrow spec was also finally making progress with adding the long anticipated German Style string types to the specification. Which, spoiler alert, is the type we implemented. These changes are not user facing and the only thing that is required from a user perspective is upgrading to the latest Polars version.

If Arrow wouldn’t gone through with it, we would have. The goal of Polars is to have good interop with Arrow, but we might deviate from certain types if they better suite our needs.

1. Why the change?

Let’s take a step back. We haven’t touched on the reason for the change yet. The string type that was supported by the Apache Arrow spec was defined by three buffers. If we ignore the validity buffer (that indicates if data is valid or not), this is represented by Arrows’s default string type as follows. In the following example we hold a 3 rows DataFrame with artist names:

shape: (3, 1)
┌────────────────────────┐
│ band                   │
│ ---                    │
│ str                    │
╞════════════════════════╡
│ ABBA                   │
│ The Velvet Underground │
│ Toots And The Maytals  │
└────────────────────────┘

All the string data would be stored contiguously in a single data buffer. And to keep track of the start and end of the strings, an extra buffer was allocated containing the offsets (Yes signed integers, because Arrow needs to interact with Java).

default arrow encoding

1.1 The good

The positive thing about encoding string data like this, is that it is rather compact. All string allocations are amortized into a single buffer. This immediately has a downside as it is very hard to predict the size of your allocation up front. Per string we only have a single 64bit integer overhead (ignoring the starting 0). Accessing the string requires an extra indirection, but sequential traversal is all local, so this is definitely cache friendly.

1.2 The bad

As I mentioned above already pre-allocating the required size of data is hard. This leads to many reallocations and memcopy’s during the building of this type. However, this downside is very relative. The huge (in my opinion pathological) downside of the this data type is that the gather and filter operation on this type scale linear in time complexity with the string sizes. So if you have $n$ rows and on average $k$ bytes in your strings, materializing the output of a join operation would require $O(n \cdot k)$. This is ok if $k$ is small (which it often is), but when there is large string (or binary) data stored, this gets very expensive.

Because a gather (taking rows by index) and filter (taking rows by boolean mask) are such core operations that are used by a lot of other operations (group-by, join, window-functions etc.) having large string data could really run a query to a halt.

2. Hyper/Umbra style string storage

The solution we (and the Arrow spec) adopted is designed by the Hyper/Umbra database system. Here the strings are stored as 16 bytes in a column (called a “view”). When the string has a length <= 12 bytes, the string data is inlined and the view contains:

4 bytes for the length: u32
4 bytes containing a prefix of the string data.
8 bytes stores the remainder of the string (zero-padded)

When the string is larger than 12 bytes, it is stored in a secondary buffer and the view contains:

4 bytes for the length: u32
4 bytes containing a prefix of the string data.
4 bytes for the buffer index
4 bytes for the string offset (in the buffer)

hyper_string_encoding

2.1 The good

Storing string data like this has several benefits.

For small strings, all string data can be inlined in the views, meaning we don’t incur an indirection on access.
Because the views are fixed width and larger string buffers can be appended, we can also mutate existing columns.
Large strings can be interned, which can save a lot of space and allows for dictionary encoding large string data in the default string data type. This is partially encoding, as we always need to allocate the views.
Operations that produce new values from existing columns, e.g. filter, gather and all supersets of operations that use these, now can copy an element in constant time. Only the views are gathered (and sometimes updated) and the buffers can remain as is.
Many operations can be smart with this data structure and only work on (parts of the) views. We already optimized many of our string kernels with this, and more will follow.

2.2 The bad

As always, there are trade-offs. Every data structure also has some downsides, however for this one they are not pathological anymore. Downsides are:

storing unique strings requires slightly more space. The default Arrow string has 8 bytes overhead per string element. This binary view encoding has 16 bytes overhead per long string, or 4 bytes (length) + 12 - string length for small strings. In my opinion, this is totally worth it considering we can elide an indirection and have fixed width random storage.
What I consider the biggest downside is that we have to do garbage collection. When we gather/filter from an array with allocated long strings, we might keep strings alive that are not used anymore. This requires us to use some heuristics to determine when we will do a garbage collection pass on the string column. And because they are heuristics, sometimes they will be unneeded.

3. Benchmarks

This was a huge effort, and we are very glad this wasn’t in vain. Below we show the results various filter operations with different selectivities (percentage of values that evaluate true). For each of this plot/benchmark we prepared a DataFrame with 16 columns (equal to the number of cores) of string data. The string data was sampled with the following patterns:

small: sample strings between 1 - 12 bytes.
medium: sample strings between 1 - 201 bytes, thereby mixing inlined and externally allocated strings.
large: sample strings of ~500 bytes representing large JSON objects.

We use 16 columns because earlier Polars versions masked the pathological case by throwing more parallelism in the mix. By ensuring we use 16 columns, this extra parallelism will not help.

These benchmarks were run on a m6i.4xlarge AWS instance. The results are huge improvement. Almost all cases outperform the old string type and the pathological cases are completely resolved. As we go from $O(n \cdot k)$ to $O(n)$, the filter performance is completely independent from the string length k.

4. More to come

Now Polars has its Arrow related code in its repository, we are able to tune our memory buffers more tightly for our use-case. We expect more performance increases to come from this, and are happy with the tight control we can now exercise over our full data-stack. Rewriting the string data-type was one such exercise. More to come!