What does “calendar-aware” mean?
Computers are notorious for always doing exactly what they are told to do, even if that is not what we want them to do. When working with dates and times, it is surprisingly easy for you to run into an issue like this, so the Polars API introduces a mini specification language that makes it easier for you to express time intervals in a way that is natural to Humans. In this article we will explore this mini specification language and some functions that make use of it.
First, you may want to understand the problem that you might trip on. If this sentence is being written on the 11th of December of 2024, what day will it be in one month from now?
You can ask Python:
import datetime as dt
print(
dt.date(2024, 12, 11) + dt.timedelta(days=31)
)
2025-01-11
Python tells you that the 11th of January comes one month after the 11th of December. With the same expression you can determine that the 11th of February is the date that comes one month after the 11th of January:
print(
dt.date(2025, 1, 11) + dt.timedelta(days=31)
)
2025-02-11
But if you use the same expression to compute the date that comes one month after the 11th of February, you get a funny result:
print(
dt.date(2025, 2, 11) + dt.timedelta(days=31)
)
2025-03-14 # ?!
For you and me, the date that comes “one month after” the 11th of February is the 11th of March.
However, there is no way to express this concept in Python’s module datetime
. Instead, you have
been adding 31 days to the dates because December and January have 31 days, but for this last
expression you would need to add 28 days instead of 31 (because 2025 is not a leap year, otherwise
you would need to add 29 days).
Thankfully, in Polars, you can do calculations with calendar-aware durations by using the expression
.dt.offset_by
:
dates = pl.Series([
dt.date(2024, 12, 11),
dt.date(2025, 1, 11),
dt.date(2025, 2, 11),
])
next_month = dates.dt.offset_by("1mo")
print(next_month)
assert (dates.dt.day() == next_month.dt.day()).all()
shape: (3,)
Series: '' [date]
[
2025-01-11
2025-02-11
2025-03-11
]
Calendar-aware duration specification
The string "1mo"
was used to represent the calendar-aware duration of 1 month. Polars supports
other calendar-aware durations:
String | Calendar-aware unit |
---|---|
"d" | day |
"w" | week |
"mo" | month |
"q" | quarter |
"y" | year |
When you use these five specifications in date/time operations, they take into account edge cases
like daylight savings offsets and whether an year is a leap year or not. The specifications for days
and months are the only ones that are strictly necessary, since the other three could be derived
from "d"
and "mo"
:
"1w"
is the same as"7d"
;"1q"
is the same as"3mo"
; and"1y"
is the same as"12mo"
.
Even so, supporting "w"
, "q"
, and "y"
, increases the readability of expressions that specify
calendar-aware durations, especially since different units can be combined. For example, the string
"2y3mo10d"
means “2 years, 3 months, and 10 days”:
print(
pl.Series([dt.date(2024, 1, 1)]).dt.offset_by("2y3mo10d")
)
shape: (1,)
Series: '' [date]
[
2026-04-11
]
For the sake of completeness, you should note that the specification that .dt.offset_by
and other
functions use supports other duration specifiers:
String | Unit |
---|---|
"ns" | nanosecond |
"us" | microsecond |
"ms" | millisecond |
"s" | second |
"m" | minute |
"h" | hour |
However, the specifiers listed above are not calendar-aware. For example, using "24h"
(24 hours)
is not equivalent to using "1d"
. In fact, in days where daylight savings changes occur, "1d"
can
be equal to "23h"
or "25h"
, depending on the direction in which the daylight savings shift
happens1:
dates = pl.Series(
[
dt.datetime(2024, 10, 26, 10),
dt.datetime(2024, 3, 30, 10),
]
).to_frame("dates").select(pl.col("dates").dt.replace_time_zone("Europe/London"))
print(
dates.with_columns(
(pl.col("dates") + pl.duration(hours=24)).alias("+24h"),
pl.col("dates").dt.offset_by("1d").alias("+1d"),
)
)
Now that you are familiar with the way Polars uses these specifiers, let us explore some functions that make use of them.
Generating date(time) ranges
You can use the Polars functions date_range
and datetime_range
to generate ranges of dates or
datetimes, respectively.
The function date_range
accepts the five calendar-aware specifiers listed above, while
the function datetime_range
also accepts the six specifiers for the units between nanoseconds and hours. For these two
functions, the string specification is used as the argument interval
, which specifies the
increments in which the date(time) values are generated.
For example, some taxpayers in Portugal have certain quarterly tax obligations that have to be met
by the date that is 1 month and 2 weeks after the end of the quarter. To determine these deadlines,
you can start by computing the beginning of the upcoming quarters for 2025 using date_range
:
beginnings = pl.date_range(
start=dt.date(2025, 1, 1),
end=dt.date(2025, 12, 31),
interval="1q",
eager=True, # Evaluate immediately to return a series.
)
print(beginnings)
shape: (4,)
Series: 'literal' [date]
[
2025-01-01
2025-04-01
2025-07-01
2025-10-01
]
Finally, you can compute the deadlines for the tax obligations by using the expression
.dt.offset_by
:
print(
beginnings.dt.offset_by("1q1mo2w")
)
shape: (4,)
Series: 'literal' [date]
[
2025-05-15
2025-08-15
2025-11-15
2026-02-15
]
Grouping data dynamically by date(time)s
Polars supports a standard aggregation operation via the
group_by
context.
Polars also supports a more flexible type of aggregation for when you want to aggregate rows based
on dates or datetimes via
the function group_by_dynamic
.
As a quick example, you can load some Apple stock data and check how many data points you have for each year:
apple_df = pl.read_csv(
"https://raw.githubusercontent.com/pola-rs/polars-static/refs/heads/master/data/appleStock.csv",
try_parse_dates=True,
)
print(apple_df.head())
shape: (5, 2)
┌────────────┬───────┐
│ Date ┆ Close │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪═══════╡
│ 1981-02-23 ┆ 24.62 │
│ 1981-05-06 ┆ 27.38 │
│ 1981-05-18 ┆ 28.0 │
│ 1981-09-25 ┆ 14.25 │
│ 1982-07-08 ┆ 11.0 │
└────────────┴───────┘
print(
apple_df
.group_by_dynamic("Date", every="1y").agg(pl.len())
.select(pl.col("Date").dt.year().alias("year"), pl.col("len"))
)
shape: (34, 2)
┌──────┬─────┐
│ year ┆ len │
│ --- ┆ --- │
│ i32 ┆ u32 │
╞══════╪═════╡
│ 1981 ┆ 4 │
│ 1982 ┆ 1 │
│ 1983 ┆ 3 │
│ 1984 ┆ 3 │
│ 1985 ┆ 3 │
│ … ┆ … │
│ 2010 ┆ 2 │
│ 2011 ┆ 2 │
│ 2012 ┆ 2 │
│ 2013 ┆ 2 │
│ 2014 ┆ 1 │
└──────┴─────┘
By default, group_by_dynamic
creates contiguous, non-overlapping windows of length given by the
parameter every
. In the example above, those windows had a length of one year, as specified by the
string "1y"
. The length of the window and the periodicity with which different windows are created
can be specified separately if you provide values for the parameters every
and period
.
The snippet below creates 10-year windows starting every 5 years:
decades = apple_df.group_by_dynamic("Date", every="5y", period="10y").agg(pl.len())
print(decades)
shape: (7, 2)
┌────────────┬─────┐
│ Date ┆ len │
│ --- ┆ --- │
│ date ┆ u32 │
╞════════════╪═════╡
│ 1980-01-01 ┆ 28 │
│ 1985-01-01 ┆ 33 │
│ 1990-01-01 ┆ 35 │
│ 1995-01-01 ┆ 36 │
│ 2000-01-01 ┆ 28 │
│ 2005-01-01 ┆ 20 │
│ 2010-01-01 ┆ 9 │
└────────────┴─────┘
There are two important things to note here. First, note how each row shows the number of data points that fall within a given decade, but because the windows overlap, the total sum is larger than the total number of original data points:
print(apple_df.shape) # (100, 1)
print(decades["len"].sum()) # 189
Second, Polars set the boundaries of the windows based on the earliest and latest data points it finds, but by default it aligns the boundaries neatly. In this example, the earliest data point is from 1981, but Polars aligns the boundary of the first window with the start of the decade in 1980:
If you prefer to align the first window with the first data point, you can do so by specifying
start_by="datapoint"
:
In case the heuristic that Polars uses to align window boundaries doesn’t suit your needs and you
want greater control over the start of the first window, you can specify the parameter offset
:
There is more to the function group_by_dynamic
, so if you want to learn more about it be sure to
check the
API reference page on group_by_dynamic
.
Time interval-based computations
The final function that you will learn about in this article is the function rolling
. The function
rolling
is somewhat similar to group_by_dynamic
, in that you can use it to aggregate values
based on time windows. The key difference is that the function rolling
always creates one window
per data point:
In other words, you have no control over the periodicity of the windows.
Like with group_by_dynamic
, a single data point may belong to multiple windows:
decades = apple_df.rolling("Date", period="10y").agg(pl.len())
print(decades)
print(decades["len"].sum()) # 2793
shape: (100, 2)
┌────────────┬─────┐
│ Date ┆ len │
│ --- ┆ --- │
│ date ┆ u32 │
╞════════════╪═════╡
│ 1981-02-23 ┆ 1 │
│ 1981-05-06 ┆ 2 │
│ 1981-05-18 ┆ 3 │
│ 1981-09-25 ┆ 4 │
│ 1982-07-08 ┆ 5 │
│ … ┆ … │
│ 2012-05-16 ┆ 22 │
│ 2012-12-04 ┆ 21 │
│ 2013-07-05 ┆ 22 │
│ 2013-11-07 ┆ 23 │
│ 2014-02-25 ┆ 23 │
└────────────┴─────┘
2793
In this case, windows overlap more in the example with rolling
than in the example with
group_by_dynamic
, where the total lengths of all windows was 189.
Remember that you cannot manipulate the total number of windows created by rolling
. The only thing
you can control is the length of the time window. If you fix the length of the time window, in
group_by_dynamic
you can influence the number of windows created by playing around with the value
of the parameter every
. If we set every
to a value that is much smaller than what we used in our
previous examples, we can get more overlapping windows in group_by_dynamic
:
overlaps = apple_df.group_by_dynamic("Date", every="3d", period="10y").agg(pl.len())
# ^^^^ 3 DAYS
print(overlaps["len"].sum()) # 102893
You can also use rolling
as an expression, as shown in the next example. Using the Apple stock
data from before, you can use the expression rolling
to compute by how much the stock price was
above the mean price over the previous 3 years:
print(
apple_df.with_columns(
(
pl.col("Close").mean().rolling(index_column="Date", period="3y")
- pl.col("Close")
).alias("Diff_to_3y_mean")
)
)
shape: (100, 3)
┌────────────┬────────┬─────────────────┐
│ Date ┆ Close ┆ Diff_to_3y_mean │
│ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ f64 │
╞════════════╪════════╪═════════════════╡
│ 1981-02-23 ┆ 24.62 ┆ 0.0 │
│ 1981-05-06 ┆ 27.38 ┆ -1.38 │
│ 1981-05-18 ┆ 28.0 ┆ -1.333333 │
│ 1981-09-25 ┆ 14.25 ┆ 9.3125 │
│ 1982-07-08 ┆ 11.0 ┆ 10.05 │
│ … ┆ … ┆ … │
│ 2012-05-16 ┆ 546.08 ┆ -211.831667 │
│ 2012-12-04 ┆ 575.85 ┆ -173.365 │
│ 2013-07-05 ┆ 417.42 ┆ 15.306667 │
│ 2013-11-07 ┆ 512.49 ┆ -68.368571 │
│ 2014-02-25 ┆ 522.06 ┆ -49.152857 │
└────────────┴────────┴─────────────────┘
The default behaviour of aligning the end of the window with the datapoint can be customised by
passing a value to the parameter offset
. This value is added to the datapoint to determine the
beginning of the window, which means that the default value for the parameter offset
is -period
.
As an example, if you wanted your windows to be centered around the datapoints, you would specify an
offset
that is half the period:
By using sufficiently large values for the parameter offset
, you can also create windows that do
not contain the respective datapoint, either because the windows end before the respective
datapoints, or because they start after.
Conclusion
Several Polars functions and expressions support a specification for time durations that is flexible
and that mimics the way humans reason about time periods. In this article you have taken a look at
date_range
, offset_by
, group_by_dynamic
, and rolling
, but you can find more by exploring the
API reference documentation.
Note: ⚠️ This blog article was written with Polars 1.17.1.
Footnotes
-
It is up to you to determine in what row the addition of
"1d"
is equivalent to adding 23 hours and which one is equivalent to adding 25 hours. ↩