Hugging Face x Polars

Fri, 9 Aug 2024

Polars now natively supports loading datasets directly from Hugging Face, applying query optimizations like predicate and projection pushdown to speed up downloads and data processing.

Hugging Face Hub

Hugging Face is a collaboration platform for the machine learning (ML) community. The Hugging Face Hub functions as a central place where users can share and experiment with open-source ML. One core element of this hub is datasets, which allows users to host their datasets in a structured and accessible manner. Polars now natively supports reading Hugging Face datasets. This allows you to easily manipulate and transform data for your ML model using Polars.

hugging-face-datasets

An example dataset in the Hugging Face Hub

Reading

In order to read a dataset, we need to construct a Hugging Face URI. This describes the location at which Polars can fetch the data. A Hugging Face URI can be constructed as follows:

hf://datasets/repository/path

repository is the location of the repository, this is usually in the format of username/repo_name. A branch can also be optionally specified by appending @branch
path is a file or directory path, or a glob pattern from the repository root.

import polars as pl

df = pl.read_csv(
    "hf://datasets/commoncrawl/statistics/tlds.csv",
    try_parse_dates=True,
)
df.head(3)

┌────────┬─────────────────┬────────────┬─────┬───────┬─────────┬──────────────────┐
│ suffix ┆ crawl           ┆ date       ┆ tld ┆ pages ┆ domains ┆ pages_per_domain │
│ ---    ┆ ---             ┆ ---        ┆ --- ┆ ---   ┆ ---     ┆ ---              │
│ str    ┆ str             ┆ date       ┆ str ┆ i64   ┆ f64     ┆ f64              │
╞════════╪═════════════════╪════════════╪═════╪═══════╪═════════╪══════════════════╡
│ net.bt ┆ CC-MAIN-2014-41 ┆ 2014-10-06 ┆ bt  ┆ 4     ┆ 1.0     ┆ 4.0              │
│ org.mk ┆ CC-MAIN-2016-44 ┆ 2016-10-31 ┆ mk  ┆ 1445  ┆ 430.0   ┆ 3.360465         │
│ com.lc ┆ CC-MAIN-2016-44 ┆ 2016-10-31 ┆ lc  ┆ 1     ┆ 1.0     ┆ 1.0              │
└────────┴─────────────────┴────────────┴─────┴───────┴─────────┴──────────────────┘

The path may include globbing patterns such as **/*.parquet to query all the files matching the pattern. Additionally, for any non-supported file formats, you can use the auto-converted Parquet files that Hugging Face provides using the @~parquet branch:

hf://datasets/repository@~parquet/path

Code snippets

Hugging Face provides code snippets to make reading datasets in Polars even easier. Under “Use this dataset” you can find Polars:

hugging-face-use-dataset

Clicking on Polars will show you the code snippet you can use to read the dataset including a link to the documentation.

hugging-face-code-snippet

Transforming

Once read, you can use the full Polars API to transform the machine learning dataset in any way you like. For example, we could calculate the total number of scraped pages per domain like this:

df.group_by("tld").agg(
    pl.col("pages").sum()
).sort("pages", descending=True).head(5)

Or we could use the lazy API to do it all at once. This has the added benefit that Polars will not read the entire dataset into memory, but only the columns and rows that are relevant to the query. This becomes even more beneficial for Parquet datasets where Polars can skip entire data regions from being transferred over the internet, potentially saving a lot of time.

pl.scan_csv(
    "hf://datasets/commoncrawl/statistics/tlds.csv",
    try_parse_dates=True,
).group_by("tld").agg(
    pl.col("pages").sum()
).sort("pages", descending=True).head(5).collect()

shape: (5, 2)
┌─────┬──────────────┐
│ tld ┆ pages        │
│ --- ┆ ---          │
│ str ┆ i64          │
╞═════╪══════════════╡
│ com ┆ 149959983274 │
│ org ┆ 18417978084  │
│ ru  ┆ 11382068527  │
│ net ┆ 10774389510  │
│ de  ┆ 8821759179   │
└─────┴──────────────┘

Supported file formats

Polars supports the following file formats accessible under both the lazy variants scan_xxx and read_xxx.

Parquet
CSV
JSON / New line JSON

Authentication

Hugging Face support private or gated datasets which are not publicly accessible. To access the dataset you need to provide the Hugging Face access token to Polars. You can create an access token at Settings -> Access Tokens or go to the following guide. Providing the access token to Polars can be done in three different ways:

By environment variable

HF_TOKEN=hf_xxxxx

By parameter

pl.read_parquet(
    "hf://datasets/roneneldan/TinyStories/data/train-*.parquet",
    storage_options={"token": ACCESS_TOKEN},
)

By CLI

Alternatively, you can use the Hugging Face CLI to authenticate. After successfully logging in with huggingface-cli login, an access token will be stored in the HF_HOME directory which defaults to ~/.cache/huggingface. Polars will then use this token for authentication.

Conclusion

In short, integrating Polars with Hugging Face makes working with data and machine learning much smoother and faster. Polars handles big datasets efficiently, while Hugging Face provides a powerful machine learning platform. Together, they simplify the process from data preparation to model deployment.

For a more detailed explanation, check out the Hugging Face Polars documentation.