Polars has been known to be amongst the fastest single node engines out there. Since last year we have been hard at work to build out our distributed engine. In this post we share our distributed benchmarks for the first time. On the PDS-H benchmark, Polars consistently outperforms Spark, delivering up to 7.7x faster performance and a 3.2x average speedup. On a single node, performance gains reach as high as 38x, with an average speedup of 6.4x.
Setup
Polars Decision Support (PDS-H) is an open implementation derived from the TPC-H benchmark. It uses the TPC-H data model, data generation methodology, and query workload to measure analytical query performance across a potential range of dataset sizes. While PDS-H closely follows the TPC-H specification, it is not an officially audited or certified TPC-H benchmark. Consequently, PDS-H results are intended for comparative evaluation within PDS-H and are not directly comparable to published TPC-H benchmark results.
Libraries
- Polars-cloud (0.8.0)
- Polars (1.41.1)
- PySpark (4.0.1)
Hardware specifications
- Single node: m8id.32xlarge (128 vCPUs, 512 GB RAM, 2.7 TB SSD)
- Distributed: 32 * m8id.xlarge (4 vCPUs, 16 GB RAM, 240 GB), the same total vCPUs, RAM and storage as single node, but spread over 32 workers.
The benchmarks were run on a scale factor of 1000, which means the dataset used is the size of roughly a terabyte of data if it were stored in uncompressed CSV.
To avoid any outliers (e.g. bad network connections), we ran the benchmarks three times and measured the fastest result. The code can be found in the following GitHub repository.
Single Node
Many of today’s workloads fit onto one large machine. In AWS (and other vendors too) a single machine can easily have more than a terabyte of RAM (for instance: m8i.96xlarge), making single node a cost-effective way to run workloads.
Below are the results:
Overall this leads to the following summary:
On total runtime, Polars is ~6.4x faster than PySpark on a single node, ranging from 3x to 38x faster per query.
Distributed
For the distributed benchmark, the data can’t be loaded onto the SSD first. The benchmark results below include network I/O to and from S3. Hence, we can’t compare the results from single and distributed runs. This will be in our next blog post.
On total runtime, Polars is ~3.2x faster than PySpark on a distributed cluster, ranging from 1.6x to 7.7x faster per query.
Takeaways
Polars has been known to be fast on a single node, but it is also very efficient when running distributed. This means you can write high performance code on your local machine and scale up to a distributed cluster when needed.
The distributed version of Polars is available on AWS or on-premise (bare-metal and Kubernetes) with a generous free tier. We encourage everyone to try it out. You can start by signing up for free on our website or contact us for more details.
What’s Next?
In this blog post we compared Polars to PySpark on a single node and a distributed cluster. Next up we want to help you understand when it is best to run distributed vs single node, by comparing Polars with itself (single node & distributed). You might think that single node is always faster than distributed given the same amount of CPU & RAM, but this is not always the case. For some of the queries above (e.g. Q1) Polars is limited by the I/O bandwith on the network and having a cluster of smaller machines is actually faster than a single node. Having the ability to choose is key to getting the best performance for your queries.