Polars in Aggregate: Going from 0.19.0 to 0.20.2

Thu, 4 Jan 2024

Polars moves fast. We typically release once a week. This process works well for us as we get features/improvements out quickly and get user feedback, while we have that topic in cache. The downside of moving so fast, is that the typical overhead of a release is brought to a minimum, we also want time to develop! What I mean by this “overhead” is taking time to reflect and share all the improvements with our users by means of a blog. So this is an attempt to make good on that neglect. The format of this and future blogs is to take an interval of releases (in this case 0.19.0 to 0.20.2), pick some highlights and explain a bit more about those.

Polars 0.19.0 was released on 30 August 2023. Since then up until Polars 0.20.2 we have made over 1000 contributions to the project (1061 to be precise), ranging from minor tweaks to significant performance improvements. This can be a minor doc improvement or a big feature we have been working on for weeks. In this blog you can find a selection of these changes.

Note: Several benchmarks were executed to research the performance improvements. The benchmark ran on an m6i.4xlarge (vCPU 16, RAM 64GB) instance type on AWS. The scripts are available as Github Gists. You can find the links to the Gists at the end of this blog post, so that you can run it yourself as well.

1. Native parquet reader for AWS, GCP & Azure

Polars has mostly focussed on a local data. Since 0.19.4 we included an async runtime for reading from the cloud. Initially this was pretty naive and we needed a few more iterations to make this performant. The challenge is finding the proper concurrent downloading strategy while not downloading data you don’t need. We use this information from the optimizer (projection and predicate pushdown), together with the metadata of the parquet files to determine which files/row groups can be downloaded. In both the streaming and the default engine, we try to prefetch data whilst giving highest downloading priority to the data we immediately need.

We benchmarked this against PyArrow’s parquet.read_pandas (which actually returns an Arrow table.) as baseline. We also benchmarked this against other popular solutions and found Polars to be very competitive.

1.1 Dataset

The benchmark reads 24 parquet files of approximately 125 MB, with on average 21 row groups each.

1.2 Read all data

For this benchmark we read either all data:

Polars

(
    pl.scan_parquet("s3://polars-inc-test/tpch/scale-10/lineitem/*.parquet", rechunk=False)
    .collect(streaming=streaming)  # streaming can True/False depending on benchmark run
)

Pyarrow

pq.read_pandas("s3://polars-inc-test/tpch/scale-10/lineitem/")

Benchmark results reading all columns from s3 parquet dataset

1.3 Read a single column

Or we select a single column. This allows the solution to only select the column data it needs.

Polars

(
    pl.scan_parquet("s3://polars-inc-test/tpch/scale-10/lineitem/*.parquet", rechunk=False)
    .select("l_orderkey")
    .collect(streaming=streaming)
)

Pyarrow

pq.read_pandas("s3://polars-inc-test/tpch/scale-10/lineitem/", columns=["l_orderkey"])

Benchmark results reading a single column from s3 parquet dataset

Note that we set rechunk=False because PyArrow doesn’t concatenate data by default, and we want both tools to do the same amount of work. Polars parquet cloud reading can also make use of predicates (filters) in the query plan and based on statistics in the parquet file (or hive partitions), choose to ignore a row group or a whole directory.

2. Enum data type (#11822)

Version 0.20.0 introduces the Enum data type, a new way of handling categorical data. It guarantees consistent encoding across columns or datasets, making operations like appending or merging more efficient without requiring re-encoding or additional cache lookups. This is different from the Categorical type, which can have performance costs due to category inference. However, it is important to note that the Enum type only allows for the use of predefined categories, and any non-specified values will result in errors. Enum is a reliable option for datasets with known categorical values, enhancing performance in data manipulation. Read more about the Enum data type and the benefits in the User guide

Example of the using the new Enum type:

# Example 1
>>> enum_dtype = pl.Enum(["Polar", "Brown", "Grizzly"])
>>> pl.Series(["Polar", "Grizzly", "Teddy"], dtype=enum_dtype)
ComputeError: value 'Teddy' is not present in Enum: LargeUtf8Array[Polar, Brown, Grizzly]

# Example 2
>>> enum_series = pl.Series(["Polar", "Grizzly", "Brown", "Brown", "Polar"], dtype=enum_dtype)
>>> cat_series = pl.Series(["Teddy", "Brown", "Brown", "Polar", "Polar"])
>>> enum_series.append(cat2_series)
SchemaError: cannot extend/append Categorical(Some(enum), Physical) with Utf8

3. Polar Plugins

Expression plugins have made their entrance. These allow you to compile a UDF (user defined function) in Rust and register that as an expression in the main python Polars binary. The main python Polars binary will the call the function via FFI. This means you will have Rust performance and all the parallelism of the Polars engine because there is no GIL locking.

Another benefit is that you have all utilize all crates available in the Rust landscape to speed up you business logic.

See more in the user guide.

4. Algorithmic speed-ups

Polars makes many performance related improvements. Here we highlight a few of them that made solid improvements. The script to run the benchmark yourself can be found in this Github Gist.

4.1 Scalable rolling quantiles

Performant rolling quantile functions are hard. If you naively implement them you will have worst case $O(n \cdot k)$ behavior. which is $O(n^2)$ if $k=n$. Here n is the dataset size and k is the window size.

After we realized our rolling_quantile functions had terrible scaling behavior, we went back to the drawing board. We implemented an algorithm that is a combination of median-filtering and merge sort. The result is a time complexity of $O(n \cdot log(k))$. Below we show the rolling median result of

Polars 0.20.2
Polars 0.19.0
pandas 2.1.4

The benchmarks runs the rolling_median function and computes the rolling median of 100M rows.

Rolling median v0.20.2 up to window size 10000

You can see that Polars 0.19.0 grows linearly with the window size and thus goes off the charts. No good.

4.1.1 Zooming in

If we look at only pandas and Polars 0.20.2, we clearly see the benefits of our new algorithm. Polars shows to scale much better and has better performance across the board. Note that this is a single threaded benchmark. There may be more improvements due to parallelism.

Rolling median v0.19, v0.20.0 and Pandas v2.1.4

4.1.2 Scaling behavior

Increasing our window size really shows that we have $O(n \cdot log(k))$ complexity.

Rolling median improvements between v0.19.0 and v0.20.2

This initial rewrite of our rolling quantile functions, initiated more attention to our rolling kernels. At the moment we are rewriting them all and expect to be able to run our rolling functions fast, streaming and parallel (within the function).

4.2 Covariance and correlation

Covariance and correlation were still naively implemented in Polars. We now have given them proper attention and implemented a SIMD accelerated covariance and correlation. The snippets below show how to compute the covariance statistics in Polars.

4.2.1 Covariance in Polars

df.select(pl.cov("a", "b"))

4.2.2 Correlation in Polars

df.select(pl.corr("a", "b"))

Benchmark results of the improved cov and corr operations in v0.20.0

As can be seen in the image above, Polars has improved its performance more ~3x since 0.19.0 and is now almost at least 2x faster than NumPy and pandas in this operation.

5. Syntax improvements

Polars aims to create a consistent, expressive, and readable API. However, this can be at odds with writing quick interactive code - for example, when doing some data exploration in a notebook. We can introduce shorter, alternative syntax, but we also want to avoid multiple ways to write the same thing.

With this delicate balance in mind, we have made a number of changes to facilitate fast prototyping.

5.1 Compact notation for creating column expressions (#10874)

The col function is probably the most used expression function in Polars. A shorter syntax is now available, taking advantage of Python attribute syntax. Note that the ‘regular’ notation is still considered the idiomatic way to write Polars code.

# Regular notation
pl.col("foo").sum()
# Short notation
pl.col.foo.sum()
# Shorter
from polars import col
col.foo.sum()
# Shortest
from polars import col as c
c.foo.sum()

5.2 Compact notation for defining `filter` inputs (#11740)

It is now possible to provide multiple inputs to the filter function, which will evaluate to an and-reduction of all inputs. Keyword argument syntax is also supported for simple equality expressions.

# Regular notation
df.filter((pl.col("a") == 2) & (pl.col("b") == "y"))
# Short notation
df.filter(a=2, b="y")
# Also works for when-then-otherwise expressions
df.select( pl.when(a=2, b="y").then(10))

6. Support for Apache Iceberg (#10375)

By popular demand, the option to scan your Apache Iceberg files has been implemented with the help of our community. The implementation integrates with several catalogues and has field-id-based schema resolution. On top of that, you can also connect to your cloud environments (AWS, Azure or GCP).

Find more information to start using it today in the API reference documentation.

For more information about Apache Iceberg, you can visit https://iceberg.apache.org

7. Closing remark

On Github the Polars team has created a public backlog to show the issues that are accepted and what the team is currently working on. You can find the board on our Github page.

Are you working with Polars in your project(s)? Let us know! We are collecting interesting case studies and success stories to better understand how Polars is used by the community. You can share these with us at info@polars.tech or connect with us on the Polars Discord.

7.1 Benchmark scripts

Below the script that we used to run the benchmark scores in the blog post. Or you find them here as Github Gists: