In
a social media post we made,
we asked you a question. Doesn’t the expression .struct.unnest
break one of Polars’ most
fundamental principles, in which one expression must produce exactly one column as output? While
this article will give a detailed answer to this question, we can spoil the final conclusion for
you: everything will be fine in the end.
TL;DR:
If you’re in a hurry, the short answer is that .struct.unnest
uses expression expansion. By using
this feature, .struct.unnest
expands into one expression per field of your struct, and that is how
it looks like a single expression produces multiple columns as output.
The problem
Polars queries obey a simple rule: one expression in your query produces one, and only one, column in the output:
import polars as pl
df = pl.DataFrame(
{
"name": ["Anne", "Abe", "Anastasia", "Anton"],
"age": [16, 23, 62, 8],
}
)
print(
df.select(
pl.col("name").str.len_chars().alias("name_length"), # 1
(pl.col("age") > 18).alias("is_adult"), # 2
(2024 - pl.col("age")).alias("birthyear"), # 3
(pl.col("name").str.len_chars() * pl.col("age")).alias("hun?"), # 4
)
)
shape: (4, 4)
# 1 # 2 # 3 # 4
┌─────────────┬──────────┬───────────┬──────┐
│ name_length ┆ is_adult ┆ birthyear ┆ hun? │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ bool ┆ i64 ┆ i64 │
╞═════════════╪══════════╪═══════════╪══════╡
│ 4 ┆ false ┆ 2008 ┆ 64 │
│ 3 ┆ true ┆ 2001 ┆ 69 │
│ 9 ┆ true ┆ 1962 ┆ 558 │
│ 5 ┆ false ┆ 2016 ┆ 40 │
└─────────────┴──────────┴───────────┴──────┘
The “one expression results in one column” rule is a design decision that makes it easier to reason
about the schema of the resulting dataframe of a complex query. However, when using the expression
.struct.unnest
, it looks like a single expression is producing multiple columns:
df = pl.DataFrame(
{
"structs": [
{"a": 1, "b": 2, "c": 3},
{"a": 4, "b": 5, "c": 6},
{"a": 7, "b": 8, "c": 9},
]
}
)
print(
df.select(
pl.col("structs").struct.unnest(), # 1
)
)
shape: (3, 3)
# 1 # 2 # 3
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3 │
│ 4 ┆ 5 ┆ 6 │
│ 7 ┆ 8 ┆ 9 │
└─────┴─────┴─────┘
The rule that one expression must produce one column seems to be at odds with the example above, but that’s not really the case. To understand why, you need to learn about expression expansion.
Expression expansion
Expression expansion is a Polars feature that lets you write short but powerful expressions whilst avoiding structural repetition. When you want to write the same expression for multiple columns, you can write a single expression that Polars will expand into multiple parallel expressions.
This snippet shows two expressions that are structurally identical:
df = pl.DataFrame(
{
"first_name": ["Anne", "Abe", "Anastasia", "Anton"],
"last_name": ["Holmes", "Watson", "Kent", "Wayne"],
}
)
df.select(
pl.col("first_name").str.len_chars(), # 1.
pl.col("last_name").str.len_chars(), # 2.
)
We can use expression expansion with the function pl.col
to avoid the repetition:
df.select(
pl.col("first_name", "last_name").str.len_chars(),
)
In this particular instance, we are using a type of expression expansion in which we know,
beforehand, how many parallel expressions Polars will create. That’s because the function pl.col
listed explicitly all columns for which we wanted to compute the string length.
Expression expansion can also be used in such a way that the number of parallel expressions that Polars executes will depend on the schema of the dataframe the expression is used on. As an example of this, consider the following expression:
import polars.selectors as cs
plus_one = cs.integer() + 1
The expression plus_one
uses Polars’ column selectors to refer to “all” integer columns. However,
since an expression is just a representation of a computation, the expression plus_one
needs a
Polars context to determine exactly what are the integer columns. In this context, the expression
refers to zero columns:
df = pl.DataFrame(
{
"first_name": ["Anne", "Abe", "Anastasia", "Anton"],
"last_name": ["Holmes", "Watson", "Kent", "Wayne"],
}
)
print(df.select(plus_one))
shape: (0, 0)
┌┐
╞╡
└┘
If we go back to the first dataframe used in this article, then plus_one
expands into a single
column:
df = pl.DataFrame(
{
"name": ["Anne", "Abe", "Anastasia", "Anton"],
"age": [16, 23, 62, 8],
}
)
print(df.select(plus_one))
shape: (4, 1)
┌─────┐
│ age │
│ --- │
│ i64 │
╞═════╡
│ 17 │
│ 24 │
│ 63 │
│ 9 │
└─────┘
And using yet another dataframe, we can see that plus_one
expands into five parallel expressions:
df = pl.DataFrame({"one": 1, "two": 2, "three": 3, "four": 4, "five": 5})
print(df.select(plus_one))
shape: (1, 5)
┌─────┬─────┬───────┬──────┬──────┐
│ one ┆ two ┆ three ┆ four ┆ five │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═══════╪══════╪══════╡
│ 2 ┆ 3 ┆ 4 ┆ 5 ┆ 6 │
└─────┴─────┴───────┴──────┴──────┘
Expression expansion with structs
So far, we’ve seen how expression expansion works with the function pl.col
and with the module
selectors
. We only scratched the surface of the functionality that both these tools offer, but now
we turn our attention to structs.
The struct datatype,
that you can imagine as being similar to a Python dictionary,
is composed of an arbitrary number of named fields. In this instance, the expression value_counts
produces a struct with two fields:
df = pl.DataFrame(
{
"ballot_id": [6145, 176723, 345623, 77234, 75246],
"best_movie": ["Cars", "Toy Story", "Toy Story", "Cars", "Toy Story"],
}
)
votes = df.select(pl.col("best_movie").value_counts())
print(votes)
shape: (2, 1)
┌─────────────────┐
│ best_movie │
│ --- │
│ struct[2] │
╞═════════════════╡
│ {"Cars",2} │
│ {"Toy Story",3} │
└─────────────────┘
The expression value_counts
produces structs as a result because the vote counts and the movie
names need to be paired together but value_counts
is a single expression, and thus it must produce
a single column as a result.
The fields of a struct can be accessed with the expression .struct.field
, which expects the name
of the field to extract. The vote count is in a field with the name "count"
, so we can extract it
with .struct.field("count")
:
print(votes.select(pl.col("best_movie").struct.field("count")))
shape: (2, 1)
┌───────┐
│ count │
│ --- │
│ u32 │
╞═══════╡
│ 2 │
│ 3 │
└───────┘
However, much like with the function pl.col
, the expression .struct.field
accepts an arbitrary
number of field names to extract:
print(votes.select(pl.col("best_movie").struct.field("best_movie", "count")))
shape: (2, 2)
┌────────────┬───────┐
│ best_movie ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════════╪═══════╡
│ Cars ┆ 2 │
│ Toy Story ┆ 3 │
└────────────┴───────┘
pl.col
and .struct.field
also support
expression expansion through regex patterns.
This pattern matching supports the regular expression syntax in general, but it also supports a
special wildcard argument "*"
that matches all names:
df = pl.DataFrame(
{
"ballot_id": [6145, 176723, 345623, 77234, 75246],
"best_movie": ["Cars", "Toy Story", "Toy Story", "Cars", "Toy Story"],
}
)
print(df.select(pl.col("*").name.to_uppercase()))
shape: (5, 2)
┌───────────┬────────────┐
│ BALLOT_ID ┆ BEST_MOVIE │
│ --- ┆ --- │
│ i64 ┆ str │
╞═══════════╪════════════╡
│ 6145 ┆ Cars │
│ 176723 ┆ Toy Story │
│ 345623 ┆ Toy Story │
│ 77234 ┆ Cars │
│ 75246 ┆ Toy Story │
└───────────┴────────────┘
print(votes.select(pl.col("best_movie").struct.field("*")))
shape: (2, 2)
┌────────────┬───────┐
│ best_movie ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════════╪═══════╡
│ Cars ┆ 2 │
│ Toy Story ┆ 3 │
└────────────┴───────┘
The final piece of the puzzle is the fact that Polars provides aliases for pl.col("*")
and
.struct.field("*")
. Instead of pl.col("*")
, Polars prefers the more readable pl.all()
. For
.struct.field("*")
, Polars prefers the alias .struct.unnest()
. Hence, the reason why the
expression .struct.unnest
seemingly produces multiple columns as a result of a single expression
is because it leverages expression expansion with a wildcard match. When applied to a struct column,
Polars will expand the expression into as many parallel expressions as there are fields, extracting
each one of them into their own column.
Further reading
There is a lot more to expression expansion than what was covered in this blog article, so we’d like
to invite you to head to the user guide and read
the section on expression expansion.
There, you will learn more about expression expansion with the function pl.col
, how to exclude
columns from a pattern, how to rename columns within expression expansion, the functionality that
the module selectors
offers, and how to programmatically generate expressions.