TLDR: We are launching our distributed engine on Kubernetes. Your own infrastructure, fast boot times, high performance, and ease of use from the same Polars API. Try it out for free at https://cloud.pola.rs/.
When we launched Polars Cloud on AWS last year we immediately got a lot of attention from Polars users to support other environments as well. Many Polars users love our API and performance, but are stuck in using multiple frameworks (e.g. Polars & Spark) to support queries that couldn’t fit on a single machine. This is a problem we want to solve with Polars. With this release teams can now deploy our distributed engine inside their own infrastructure and experience a single framework that works well on any scale.
When you run a Polars cluster in a kubernetes cluster. Users can decouple the hardware for the compute from the client that executes that query. You can run a large petabyte scale distributed query that is fired from your laptop, or run many parallel queries on the cluster. Scaling horizontally is easy.
A single binary and helm chart
Open source Polars is fast, lightweight, versatile and has very fast boot times. Getting started is easy and can be replicated in many different environments. Distributed engines have always felt the opposite. Deployments span multiple microservices often with heavy runtimes (JVM) and boot times in the minutes or even tens of minutes.
When developing our distributed engine we wanted to keep that lightweight and snappy experience. Starting a Polars cluster is easy and takes seconds, with a simple helm chart and helm install command. This makes it easy to spin up a dedicated cluster for each ETL job or have a single long running one. With this design, you can also deploy this in many other environments outside of kubernetes (including bare metal linux, SLURM). Contact us to discuss the options.
Beyond the compute
Besides getting access to the distributed engine, deploying to Kubernetes also provides several other new features:
Query Profiling
Early on our distributed engine was a black box. Until the query successfully finished you didn’t know what it was doing. To make the engine more transparant, we now ship a frontend and API on top of it to facilitate query profiling. This shows you exactly what operation is being performed, how many rows are being processed at each node in real time, and how much data is being shuffled between workers. This also works great for single node queries and gives you an unparalleled insight in Polars’ engine at runtime.
This profiling data is saved and available historically to debug old pipeline runs and find performance bottlenecks. You can view an example query below:
Data Lineage
With the use of LLMs, it is increasingly becoming more important to provide the right context. For instance about how and when this table was created and updated.
OpenLineage is an open framework for data lineage collection and analysis which creates a deeper understanding of how data is produced and used. With this release our engine is able to emit OpenLineage events to any collector in your environment (for example Marquez).
Getting Started
Deploying a Polars is best done through helm, a package manager. First create your account on https://cloud.pola.rs and export your service account credentials. Starting a cluster is done with a simple helm command:
helm repo add polars-inc https://polars-inc.github.io/helm-charts && helm repo update
helm upgrade --install polars polars-inc/polars \
--set license.onPrem.enabled=true \
--set license.onPrem.workspaceId=<WORKSPACE_ID> \
--set license.onPrem.clientId=<SERVICE_ACCOUNT_ID> \
--set license.onPrem.clientSecret=<SERVICE_ACCOUNT_SECRET>
– set ….
With the cluster up and running you can use your existing Polars code, but instead of executing locally you point it to the remote cluster by calling .remote(ctx) on the LazyFrame
import polars as pl
import polars_cloud as pc
ctx = pc.ClusterContext(compute_address="...")
query = (
pl.scan_parquet("s3://bucket/data.parquet")
.filter(pl.col("value") > 0)
.group_by("category")
.agg(pl.sum("value"))
)
result = query.remote(ctx).execute()
The full walkthrough, including setup guides for Amazon Elastic Kubernetes Service (EKS), Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE), and bare metal can be found in the getting started guide.
Next steps
Our team is busy with the next set of features, stay tuned for more updates the coming weeks:
- Benchmarks: We are already much faster than Apache Spark in a distributed setting. There is a big optimization we will land in a few weeks. We will share more on this soon.
- Expressions: For complex Polars expressions our distributed engine currently falls back to a single node. The next update contains improvements to the query planner enabling parallel execution of these expressions.
- Distributed Iceberg round trip: Reading from Iceberg occurs in a distributed manner, but writes do not yet. In the next update, we will include atomic writes and add distributed Iceberg sinks.
Closing remarks
We are providing a generous free tier for engineers trying out distributed Polars on Kubernetes. The first 10.000 vCPU hours per month are free1. Go to https://cloud.pola.rs/ and get up and running in less than 10 minutes with our detailed getting started guide.
Footnotes
-
With a maximum of one concurrent cluster and a cluster size up to 64 nodes and 1024 cores. ↩