Polars in Aggregate: Improved I/O, Hugging Face support, and more

Thu, 25 Jul 2024

Polars in Aggregate is a summary of what we have been working on in recent weeks. This can be seen as a cherry-picked changelog with a bit more context. This Polars in Aggregate covers the releases of Polars 1.0 to Polars 1.2.1.

Cloud file cache manager (#16892)

The CSV, IPC and NDJSON file formats lack the extensive metadata available in Parquet files that allows for skipping parts of the files during downloads. As such, they often require much more time to download. To improve performance, a file caching mechanism has been introduced for the scan_(csv|ipc|ndjson) functions. Downloaded cloud data is now temporarily persisted to disk, allowing subsequent scans on the same file to potentially skip re-downloading the same data. Some key features include:

Persistence across process restarts: Cloud files are stored at a consistent location keyed by their path and last-modified time, allowing new Polars processes to re-use downloads from previous Polars processes.
Inter-process synchronization: As the cached files can be shared by multiple Python processes running Polars on the same machine, access to the cached files is performed with process-level synchronization.
Remote file updates: As the remote file can be updated, accesses to cached files during the first collect() of a LazyFrame will check that the cached file is up to date against the remote file, and when necessary a newer version is downloaded.
Cleaning up: By default, cached files are removed if they have not been accessed within the last hour, configurable using the file_cache_ttl argument in supported scan functions.

Improved Hive partition support

It is very common for large datasets to be stored using Hive partitioning. This allows predicates to be applied on a file level therefore significantly speeding up the query. In this update, we have added support for datetime partition keys which is common for time series data and removes the need to parse dates into strings and vice versa. Another big update is the support for writing Hive partitioned datasets.

Datetime parsing for Hive partition keys (#17256)

Polars can now natively load datetime values from Hive paths, as well as perform partition pruning on datetime parts to skip loading unneeded data (#17545):

lf = pl.scan_parquet(
    "hive_dates/date1=2024-01-01/date2=2023-01-01%2000%3A00%3A00.000000/00000000.parquet",
    hive_partitioning=True,
).collect()

shape: (1, 3)
┌────────────┬─────────────────────┬─────┐
│ date1      ┆ date2               ┆ x   │
│ ---        ┆ ---                 ┆ --- │
│ date       ┆ datetime[μs]        ┆ i32 │
╞════════════╪═════════════════════╪═════╡
│ 2024-01-01 ┆ 2023-01-01 00:00:00 ┆ 1   │
└────────────┴─────────────────────┴─────┘

lf.filter(pl.col("date1") == date(1970, 1, 1)).collect()
# Parquet file can be skipped, the statistics were sufficient to apply the predicate.

Writing Hive partitioned Parquet datasets (#17324)

DataFrames can now be easily written in a Hive partitioned manner by using the partition_by argument in write_parquet:

df = pl.DataFrame(
    {
        "date1" : [date(2024, 1, 1), date(2024, 1, 2)],
        "date2" : [datetime(2024, 1, 1), datetime(2024, 1, 2)],
        "x"     : [1, 2],
    }
)

df.write_parquet("out/", partition_by=["date1", "date2"])

out/
├── date1=2024-01-01
│   └── date2=2024-01-01%2000%3A00%3A00.000000
│       └── 00000000.parquet
└── date1=2024-01-02
    └── date2=2024-01-02%2000%3A00%3A00.000000
        └── 00000000.parquet

Scanning Hive partitioned IPC/Feather datasets (#17434)

Hive partitioned IPC datasets can now be loaded using scan_ipc, and enjoys the same partition pruning optimizations as scan_parquet.

Reimplementation of the `Struct` data type (#17522)

We have completely reimplemented the Struct data type. This was needed because our previous implementation didn’t support outer nullability.

For example, given a Struct with fields "a" and "b", we did support inner nullability ([{"a", None, "b": 0}, {"a": 1, "b": None}]), but we did not support outer nullability ([None, {"a": 1, "b": 1}]), where the whole element is null instead of parts/all of inner fields.

With the merging of #17522 and follow-up PRs, we fixed a bunch of bugs regarding outer nullability and chunking.

Hugging Face Support (#17665)

Hugging Face datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.

Polars already provided support through an external filesystem package fsspec. However, this approach did not allow pushing down predicates and meant we always read in the entire dataset.

With this update, we now provide native support for the Hugging Face file system, which greatly improves download times as Polars’ native pre-fetcher can do its magic. You can directly query datasets with both the lazy (scan_xxx) and eager (read_xxx) methods using a Hugging Face URI and get all the benefits of our query optimizer.

pl.read_parquet("hf://datasets/roneneldan/TinyStories/data/train-*.parquet")

The URI can be constructed from the username and dataset name like this: hf://datasets/{username}/{dataset}/{path_to_file}. The path may include globbing patterns to query all the files matching the pattern. Additionally, for any non-supported file formats, you can use the auto-converted Parquet files that Hugging Face provides using the @~parquet branch.

Support for right joins (#17441)

This one was a long time coming. All the architecture was available, but we just never gave it priority. Now we did and Polars finally supports right joins:

df_customers = pl.DataFrame(
    {
        "customer_id": [1, 2, 3],
        "name": ["Alice", "Bob", "Charlie"],
    }
)

df_orders = pl.DataFrame(
    {
        "order_id": ["a", "b", "c"],
        "customer_id": [1, 2, 2],
        "amount": [100, 200, 300],
    }
)

df_customers.join(df_orders, on="customer_id", how="right")

shape: (3, 4)
┌───────┬──────────┬─────────────┬────────┐
│ name  ┆ order_id ┆ customer_id ┆ amount │
│ ---   ┆ ---      ┆ ---         ┆ ---    │
│ str   ┆ str      ┆ i64         ┆ i64    │
╞═══════╪══════════╪═════════════╪════════╡
│ Alice ┆ a        ┆ 1           ┆ 100    │
│ Bob   ┆ b        ┆ 2           ┆ 200    │
│ Bob   ┆ c        ┆ 2           ┆ 300    │
└───────┴──────────┴─────────────┴────────┘

As we plan to support more exotic join operations as non-equi joins, we started out by fully supporting the “default” ones. Polars now supports the following join types: inner, left, right, full, cross, semi, anti.

Binary view memory mapping in IPC (#17084)

In this post, we introduced the Polars new string type implementation. Back then, we were one of the first Arrow-native projects to adopt the Arrow binary view type. This meant that we had to be conservative with exposing the internals to other Arrow consumers. As of today (pyarrow==17.0.0), Arrow is fully capable of reading and writing this new data type, which allows us to write this format to IPC/Arrow files directly.

This means that users can memory map IPC/Arrow files without triggering a full read. This can solve as a great caching mechanism of large datasets that don’t fit into RAM.

Join the Polars community

Did you know that Polars has an active community on Discord? It’s a great place to meet other Polars users, ask questions and help others by answering their questions. Join the Discord today by clicking here!

We are hiring

Are you excited about the work we do at Polars? We are hiring Rust software engineers to work on realizing the new era of DataFrames.

Rust backend engineer

As a founding Backend Engineer, you’ll shape the design of Polars Cloud in order to create a seamless experience for our users. Together with our product manager and full stack engineers, you transform requirements into technical design into implementation.

https://hiring.pola.rs/o/rust-backend-engineer/

Database engineer

As a database engineer, you actively work on our DataFrame query engine. Together with the core team and a vibrant open source community, you shape the future of our API and its implementation. In your work, you’ll identify important areas with the highest community impact and start designing and implementing.

https://hiring.pola.rs/o/database-engineer/