Breaking the rules with expression expansion

Thu, 5 Dec 2024

In a social media post we made, we asked you a question. Doesn’t the expression .struct.unnest break one of Polars’ most fundamental principles, in which one expression must produce exactly one column as output? While this article will give a detailed answer to this question, we can spoil the final conclusion for you: everything will be fine in the end.

TL;DR:

If you’re in a hurry, the short answer is that .struct.unnest uses expression expansion. By using this feature, .struct.unnest expands into one expression per field of your struct, and that is how it looks like a single expression produces multiple columns as output.

The problem

Polars queries obey a simple rule: one expression in your query produces one, and only one, column in the output:

import polars as pl

df = pl.DataFrame(
    {
        "name": ["Anne", "Abe", "Anastasia", "Anton"],
        "age": [16, 23, 62, 8],
    }
)

print(
    df.select(
        pl.col("name").str.len_chars().alias("name_length"),             # 1
        (pl.col("age") > 18).alias("is_adult"),                          # 2
        (2024 - pl.col("age")).alias("birthyear"),                       # 3
        (pl.col("name").str.len_chars() * pl.col("age")).alias("hun?"),  # 4
    )
)

shape: (4, 4)
# 1           # 2        # 3         # 4
┌─────────────┬──────────┬───────────┬──────┐
│ name_length ┆ is_adult ┆ birthyear ┆ hun? │
│ ---         ┆ ---      ┆ ---       ┆ ---  │
│ u32         ┆ bool     ┆ i64       ┆ i64  │
╞═════════════╪══════════╪═══════════╪══════╡
│ 4           ┆ false    ┆ 2008      ┆ 64   │
│ 3           ┆ true     ┆ 2001      ┆ 69   │
│ 9           ┆ true     ┆ 1962      ┆ 558  │
│ 5           ┆ false    ┆ 2016      ┆ 40   │
└─────────────┴──────────┴───────────┴──────┘

The “one expression results in one column” rule is a design decision that makes it easier to reason about the schema of the resulting dataframe of a complex query. However, when using the expression .struct.unnest, it looks like a single expression is producing multiple columns:

df = pl.DataFrame(
    {
        "structs": [
            {"a": 1, "b": 2, "c": 3},
            {"a": 4, "b": 5, "c": 6},
            {"a": 7, "b": 8, "c": 9},
        ]
    }
)

print(
    df.select(
        pl.col("structs").struct.unnest(),  # 1
    )
)

shape: (3, 3)
# 1   # 2   # 3
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
│ 4   ┆ 5   ┆ 6   │
│ 7   ┆ 8   ┆ 9   │
└─────┴─────┴─────┘

The rule that one expression must produce one column seems to be at odds with the example above, but that’s not really the case. To understand why, you need to learn about expression expansion.

Expression expansion

Expression expansion is a Polars feature that lets you write short but powerful expressions whilst avoiding structural repetition. When you want to write the same expression for multiple columns, you can write a single expression that Polars will expand into multiple parallel expressions.

This snippet shows two expressions that are structurally identical:

df = pl.DataFrame(
    {
        "first_name": ["Anne", "Abe", "Anastasia", "Anton"],
        "last_name": ["Holmes", "Watson", "Kent", "Wayne"],
    }
)
df.select(
    pl.col("first_name").str.len_chars(),  # 1.
    pl.col("last_name").str.len_chars(),   # 2.
)

We can use expression expansion with the function pl.col to avoid the repetition:

df.select(
    pl.col("first_name", "last_name").str.len_chars(),
)

In this particular instance, we are using a type of expression expansion in which we know, beforehand, how many parallel expressions Polars will create. That’s because the function pl.col listed explicitly all columns for which we wanted to compute the string length.

Expression expansion can also be used in such a way that the number of parallel expressions that Polars executes will depend on the schema of the dataframe the expression is used on. As an example of this, consider the following expression:

import polars.selectors as cs

plus_one = cs.integer() + 1

The expression plus_one uses Polars’ column selectors to refer to “all” integer columns. However, since an expression is just a representation of a computation, the expression plus_one needs a Polars context to determine exactly what are the integer columns. In this context, the expression refers to zero columns:

df = pl.DataFrame(
    {
        "first_name": ["Anne", "Abe", "Anastasia", "Anton"],
        "last_name": ["Holmes", "Watson", "Kent", "Wayne"],
    }
)
print(df.select(plus_one))

shape: (0, 0)
┌┐
╞╡
└┘

If we go back to the first dataframe used in this article, then plus_one expands into a single column:

df = pl.DataFrame(
    {
        "name": ["Anne", "Abe", "Anastasia", "Anton"],
        "age": [16, 23, 62, 8],
    }
)
print(df.select(plus_one))

shape: (4, 1)
┌─────┐
│ age │
│ --- │
│ i64 │
╞═════╡
│ 17  │
│ 24  │
│ 63  │
│ 9   │
└─────┘

And using yet another dataframe, we can see that plus_one expands into five parallel expressions:

df = pl.DataFrame({"one": 1, "two": 2, "three": 3, "four": 4, "five": 5})
print(df.select(plus_one))

shape: (1, 5)
┌─────┬─────┬───────┬──────┬──────┐
│ one ┆ two ┆ three ┆ four ┆ five │
│ --- ┆ --- ┆ ---   ┆ ---  ┆ ---  │
│ i64 ┆ i64 ┆ i64   ┆ i64  ┆ i64  │
╞═════╪═════╪═══════╪══════╪══════╡
│ 2   ┆ 3   ┆ 4     ┆ 5    ┆ 6    │
└─────┴─────┴───────┴──────┴──────┘

Expression expansion with structs

So far, we’ve seen how expression expansion works with the function pl.col and with the module selectors. We only scratched the surface of the functionality that both these tools offer, but now we turn our attention to structs.

The struct datatype, that you can imagine as being similar to a Python dictionary, is composed of an arbitrary number of named fields. In this instance, the expression value_counts produces a struct with two fields:

df = pl.DataFrame(
    {
        "ballot_id": [6145, 176723, 345623, 77234, 75246],
        "best_movie": ["Cars", "Toy Story", "Toy Story", "Cars", "Toy Story"],
    }
)
votes = df.select(pl.col("best_movie").value_counts())
print(votes)

shape: (2, 1)
┌─────────────────┐
│ best_movie      │
│ ---             │
│ struct[2]       │
╞═════════════════╡
│ {"Cars",2}      │
│ {"Toy Story",3} │
└─────────────────┘

The expression value_counts produces structs as a result because the vote counts and the movie names need to be paired together but value_counts is a single expression, and thus it must produce a single column as a result.

The fields of a struct can be accessed with the expression .struct.field, which expects the name of the field to extract. The vote count is in a field with the name "count", so we can extract it with .struct.field("count"):

print(votes.select(pl.col("best_movie").struct.field("count")))

shape: (2, 1)
┌───────┐
│ count │
│ ---   │
│ u32   │
╞═══════╡
│ 2     │
│ 3     │
└───────┘

However, much like with the function pl.col, the expression .struct.field accepts an arbitrary number of field names to extract:

print(votes.select(pl.col("best_movie").struct.field("best_movie", "count")))

shape: (2, 2)
┌────────────┬───────┐
│ best_movie ┆ count │
│ ---        ┆ ---   │
│ str        ┆ u32   │
╞════════════╪═══════╡
│ Cars       ┆ 2     │
│ Toy Story  ┆ 3     │
└────────────┴───────┘

pl.col and .struct.field also support expression expansion through regex patterns. This pattern matching supports the regular expression syntax in general, but it also supports a special wildcard argument "*" that matches all names:

df = pl.DataFrame(
    {
        "ballot_id": [6145, 176723, 345623, 77234, 75246],
        "best_movie": ["Cars", "Toy Story", "Toy Story", "Cars", "Toy Story"],
    }
)
print(df.select(pl.col("*").name.to_uppercase()))

shape: (5, 2)
┌───────────┬────────────┐
│ BALLOT_ID ┆ BEST_MOVIE │
│ ---       ┆ ---        │
│ i64       ┆ str        │
╞═══════════╪════════════╡
│ 6145      ┆ Cars       │
│ 176723    ┆ Toy Story  │
│ 345623    ┆ Toy Story  │
│ 77234     ┆ Cars       │
│ 75246     ┆ Toy Story  │
└───────────┴────────────┘

print(votes.select(pl.col("best_movie").struct.field("*")))

shape: (2, 2)
┌────────────┬───────┐
│ best_movie ┆ count │
│ ---        ┆ ---   │
│ str        ┆ u32   │
╞════════════╪═══════╡
│ Cars       ┆ 2     │
│ Toy Story  ┆ 3     │
└────────────┴───────┘

The final piece of the puzzle is the fact that Polars provides aliases for pl.col("*") and .struct.field("*"). Instead of pl.col("*"), Polars prefers the more readable pl.all(). For .struct.field("*"), Polars prefers the alias .struct.unnest(). Hence, the reason why the expression .struct.unnest seemingly produces multiple columns as a result of a single expression is because it leverages expression expansion with a wildcard match. When applied to a struct column, Polars will expand the expression into as many parallel expressions as there are fields, extracting each one of them into their own column.

Breaking the rules with expression expansion

TL;DR:

The problem

Expression expansion

Expression expansion with structs

Further reading

Let's keep in touch