We're hiring
Back to blog

Polars in Aggregate: Streaming Expands, Lakehouse I/O, and Cloud Profiling

By Thijs Nieuwdorp on Thu, 16 Apr 2026

Since the last edition of Polars in Aggregate in December:

  • 12 releases have been published.
  • 778 PRs were merged to open source Polars.
  • 95 contributors submitted code to open source Polars. Thank you!

This post summarizes the biggest highlights of that work.

Highlighted Features

The Streaming Engine Keeps Expanding

PRs: #26398, #26563

This quarter extended streaming into more query patterns. 1.38 added a streaming merge join, and 1.39 added a streaming AsOf join. This brings us closer to setting the streaming engine as the default for everyone. We already recommend enabling it by setting pl.Config.set_engine_affinity("streaming") at the beginning of your code.

Streaming Scan and Sink Across All Major Formats

PRs: #25948, #25900, #25746, #25910, #25964

All major formats now have a streaming scan implementation, making reads both faster and more memory-efficient, including CSV. In 1.37 the new sink pipeline landed for NDJSON, CSV, and IPC, and partitioned sink_* usage now dispatches to the new streaming engine by default. 1.39 added support to stream files directly from cloud instead of pulling the whole file first, for scan_ndjson() and scan_lines().

Delta Lake Read/Write

PRs: #25994, #26190

Delta got sink_delta() in 1.37, which means lazy queries can now write directly back to Delta Lake without first materializing the whole result in memory. In 1.38, scan_delta() was refactored onto the Python dataset interface and gained file-statistics-based predicate pushdown for batch reads, making selective scans cheaper.

Apache Iceberg Read/Write

PRs: #26242, #26799, #26826

1.39 completes the Iceberg read/write roundtrip with sink_iceberg(). Polars was already able to read Iceberg tables, but it can now scan, transform, and commit a new Iceberg snapshot from the streaming engine. scan_iceberg() now accepts table identifier strings, making it easier to integrate with catalog-backed workloads.

Polars Cloud Gets Better Operational Tooling

This quarter brought the first layer of serious observability to Polars Cloud. The new Query Profiler exposes per-stage resource metrics, physical-plan details, shuffle percentages, and node-level CPU, RAM, and network usage. In the accompanying post we used it to tune a TPC-H query from a single oversized machine to a better-fitting cluster in five runs, ending up 54% faster and 64% cheaper. This allows users to move beyond “does it run?” to “why is this query expensive?”.

On top of that, we continue our work to make the distributed query planner faster and have implemented a cost-based planner that, on average, is 10% more performant.

New Partitioning API: pl.PartitionBy

PR: #26004

As Polars adds more sink paths and more cloud-oriented execution, users need one consistent way to describe how output should be partitioned. PartitionBy gives that configuration a real object instead of scattering it across ad-hoc parameters. It expresses hive-style partitioning, row- or byte-based file size limits, and user-definable file path logic, and plugs directly into the sink_* APIs.

And Much More

This is just a glimpse of what we’ve built.

If you want to stay up to date, you can follow us on the following platforms:

[LinkedIn] - [Twitter/X] - [Bluesky] - [GitHub]

And you can learn more about Polars Cloud at: https://cloud.pola.rs/

1
2
4
3
5
6
7
8
9
10
11
12