Updated PDS-H benchmark results (May 2025)

Sun, 1 Jun 2025

Its been a while since we last published benchmark results here. A lot of hard work has gone into optimizing our query engine since then, and other dataframe libraries such as pandas have also made great improvements. Time to see where we stand!

PDS-H benchmark

Polars Decision Support (PDS) benchmarks are derived from the TPC-H Benchmarks and as such any results obtained using PDS are not comparable to published TPC-H Benchmark results, as the results obtained from using PDS do not comply with the TPC-H Benchmarks.

The original TPC-H benchmark is a decision support benchmark. It contains a standardized way to generate data on various scales and a set of SQL queries to run.

By comparing the performance of various engines on each query, it’s easy to see the strong and weak points of each engine.

Adjusting for DataFrame front-ends

The original TPC-H benchmark is intended for SQL databases and doesn’t allow modification of the SQL queries. We are trying to benchmark both SQL front-ends and DataFrame front-ends, so the original rules have to be modified a little. We believe that the SQL queries should be translated semantically to the idiomatic query of the host tool. To do this, we adhere to the following rules:

It is not allowed to insert new operations, e.g. no pruning a table before a join.
Every solution must provide one query per question independent of the data source.
The solution must call its own API.
It is allowed to declare the type of joins as this fits semantical reasoning in DataFrame APIs.
A solution must choose a single engine/mode for all the queries. It is allowed to propose different solutions from the same vendor, e.g. (spark-sql, pyspark, polars-sql, polars-default, polars-streaming). However these solutions should run all the queries, showing their strengths and weaknesses. No cherry picking.
Joins may not be reordered manually.

Setup

Our implementation of the benchmark is open source and available on GitHub.

Libraries

We benchmarked the latest versions of the following libraries:

The following configuration is relevant for the various libraries:

For PySpark, driver memory and executor memory were set to 100g and 100g respectively.
For Dask, the threads scheduler was used.
For pandas/Dask/Modin, PyArrow data types were enabled.
For pandas, copy-on-write was enabled.

Note that we no longer include the following DataFrame libraries:

vaex because it is no longer being maintained.
Modin we had a hard time getting better results than pandas.

Note that not all queries are implemented for all libraries. This is an ongoing process and PR’s are welcome.

Common hardware/software

All queries were run under the following conditions:

Machine: AWS c7a.24xlarge (96 vCPUs / 192 GB memory)
OS: Ubuntu 22.02 LTS x86-64

Scale factors

PDS-H allows for running queries at various data sizes. We did a run on scale factor 10 and scale factor 100, where 1 unit of scale factor is roughly equivalent to 1GB of csv data. Tools that performed poorly on SF-10 were excluded from the SF-100 run, as their performance issues would only worsen at higher scale factors.

Results

Below are bar charts showing execution times (in seconds) for each query, with shorter bars indicating better performance. For both SF-10 and SF-100, Polars and DuckDB prove to be in a league of their own, being an order of magnitude faster than Dask and PySpark. Pandas is only run in the SF-10 benchmark as it’s single threaded execution and lack of query optimizer lead to 2 orders of magnitudes difference and OOM failures on higher scale factors.

SF-10

pdsh_benchmark_sf-10

solution	total_time	factor
polars[streaming]-1.30.0	3.89	1.0
duckdb-1.3.0	5.87	1.5
polars[in-memory]-1.30.0	9.68	2.5
dask-2025.5.1	46.02	11.8
pyspark-4.0.0	120.11	30.9
pandas-2.2.3	365.71	94.0

SF-100

Once we scale to SF-100 we see that Polars’ in-memory engine slows down a lot relative to other solutions. This is mostly due to CPU-cache misses. As the in-memory engine works on all data in memory, we are getting memory bound. We see that Polars streaming engine doesn’t suffer from these effects. Furthermore the streaming engine also has better core utilization. Polars’ streaming engine and DuckDB again have similar results. Polars falls behind in query 21. Query 21 has a range join, which we haven’t implemented in the streaming engine yet. For this query we fall back to the in-memory engine for the range join (until we have streaming support).

pdsh_benchmark_sf-10

solution	total_time	factor
duckdb-1.3.0	19.65	1.0
polars[streaming]-1.30.0	23.94	1.2
polars[in-memory]-1.30.0	152.27	7.7
pyspark-4.0.0	312.43	15.9
dask-2025.5.1	548.52	27.9

Takeaways

The benchmark shows what huge leaps we made with the new streaming engine. Compared to the in-memory engine, this shows that Polars streaming can be up to 3-7x faster. Your mileage will of course vary, depending on your use case and hardware. To benefit from our latest improvements, opt-in to the new streaming engine by setting pl.Config.set_engine_affinity(engine="streaming") at the start of your project. We expect the streaming engine to become the default engine for Polars lazy later this year.

Updated PDS-H benchmark results (May 2025)

PDS-H benchmark

Adjusting for DataFrame front-ends

Setup

Libraries

Common hardware/software

Scale factors

Results

SF-10

SF-100

Takeaways

Let's keep in touch