Exploring calendar-aware date operations in Polars

Mon, 23 Dec 2024

What does “calendar-aware” mean?

Computers are notorious for always doing exactly what they are told to do, even if that is not what we want them to do. When working with dates and times, it is surprisingly easy for you to run into an issue like this, so the Polars API introduces a mini specification language that makes it easier for you to express time intervals in a way that is natural to Humans. In this article we will explore this mini specification language and some functions that make use of it.

First, you may want to understand the problem that you might trip on. If this sentence is being written on the 11th of December of 2024, what day will it be in one month from now?

You can ask Python:

import datetime as dt

print(
    dt.date(2024, 12, 11) + dt.timedelta(days=31)
)

2025-01-11

Python tells you that the 11th of January comes one month after the 11th of December. With the same expression you can determine that the 11th of February is the date that comes one month after the 11th of January:

print(
    dt.date(2025, 1, 11) + dt.timedelta(days=31)
)

2025-02-11

But if you use the same expression to compute the date that comes one month after the 11th of February, you get a funny result:

print(
    dt.date(2025, 2, 11) + dt.timedelta(days=31)
)

2025-03-14  # ?!

For you and me, the date that comes “one month after” the 11th of February is the 11th of March. However, there is no way to express this concept in Python’s module datetime. Instead, you have been adding 31 days to the dates because December and January have 31 days, but for this last expression you would need to add 28 days instead of 31 (because 2025 is not a leap year, otherwise you would need to add 29 days).

Thankfully, in Polars, you can do calculations with calendar-aware durations by using the expression .dt.offset_by:

dates = pl.Series([
    dt.date(2024, 12, 11),
    dt.date(2025, 1, 11),
    dt.date(2025, 2, 11),
])

next_month = dates.dt.offset_by("1mo")
print(next_month)
assert (dates.dt.day() == next_month.dt.day()).all()

shape: (3,)
Series: '' [date]
[
    2025-01-11
    2025-02-11
    2025-03-11
]

Calendar-aware duration specification

The string "1mo" was used to represent the calendar-aware duration of 1 month. Polars supports other calendar-aware durations:

String	Calendar-aware unit
`"d"`	day
`"w"`	week
`"mo"`	month
`"q"`	quarter
`"y"`	year

When you use these five specifications in date/time operations, they take into account edge cases like daylight savings offsets and whether an year is a leap year or not. The specifications for days and months are the only ones that are strictly necessary, since the other three could be derived from "d" and "mo":

"1w" is the same as "7d";
"1q" is the same as "3mo"; and
"1y" is the same as "12mo".

Even so, supporting "w", "q", and "y", increases the readability of expressions that specify calendar-aware durations, especially since different units can be combined. For example, the string "2y3mo10d" means “2 years, 3 months, and 10 days”:

print(
    pl.Series([dt.date(2024, 1, 1)]).dt.offset_by("2y3mo10d")
)

shape: (1,)
Series: '' [date]
[
    2026-04-11
]

For the sake of completeness, you should note that the specification that .dt.offset_by and other functions use supports other duration specifiers:

String	Unit
`"ns"`	nanosecond
`"us"`	microsecond
`"ms"`	millisecond
`"s"`	second
`"m"`	minute
`"h"`	hour

However, the specifiers listed above are not calendar-aware. For example, using "24h" (24 hours) is not equivalent to using "1d". In fact, in days where daylight savings changes occur, "1d" can be equal to "23h" or "25h", depending on the direction in which the daylight savings shift happens¹:

dates = pl.Series(
    [
        dt.datetime(2024, 10, 26, 10),
        dt.datetime(2024, 3, 30, 10),
    ]
).to_frame("dates").select(pl.col("dates").dt.replace_time_zone("Europe/London"))

print(
    dates.with_columns(
        (pl.col("dates") + pl.duration(hours=24)).alias("+24h"),
        pl.col("dates").dt.offset_by("1d").alias("+1d"),
    )
)

Now that you are familiar with the way Polars uses these specifiers, let us explore some functions that make use of them.

Generating date(time) ranges

You can use the Polars functions date_range and datetime_range to generate ranges of dates or datetimes, respectively. The function date_range accepts the five calendar-aware specifiers listed above, while the function datetime_range also accepts the six specifiers for the units between nanoseconds and hours. For these two functions, the string specification is used as the argument interval, which specifies the increments in which the date(time) values are generated.

For example, some taxpayers in Portugal have certain quarterly tax obligations that have to be met by the date that is 1 month and 2 weeks after the end of the quarter. To determine these deadlines, you can start by computing the beginning of the upcoming quarters for 2025 using date_range:

beginnings = pl.date_range(
    start=dt.date(2025, 1, 1),
    end=dt.date(2025, 12, 31),
    interval="1q",
    eager=True,  # Evaluate immediately to return a series.
)
print(beginnings)

shape: (4,)
Series: 'literal' [date]
[
    2025-01-01
    2025-04-01
    2025-07-01
    2025-10-01
]

Finally, you can compute the deadlines for the tax obligations by using the expression .dt.offset_by:

print(
    beginnings.dt.offset_by("1q1mo2w")
)

shape: (4,)
Series: 'literal' [date]
[
    2025-05-15
    2025-08-15
    2025-11-15
    2026-02-15
]

Grouping data dynamically by date(time)s

Polars supports a standard aggregation operation via the group_by context. Polars also supports a more flexible type of aggregation for when you want to aggregate rows based on dates or datetimes via the function group_by_dynamic.

As a quick example, you can load some Apple stock data and check how many data points you have for each year:

apple_df = pl.read_csv(
    "https://raw.githubusercontent.com/pola-rs/polars-static/refs/heads/master/data/appleStock.csv",
    try_parse_dates=True,
)
print(apple_df.head())

shape: (5, 2)
┌────────────┬───────┐
│ Date       ┆ Close │
│ ---        ┆ ---   │
│ date       ┆ f64   │
╞════════════╪═══════╡
│ 1981-02-23 ┆ 24.62 │
│ 1981-05-06 ┆ 27.38 │
│ 1981-05-18 ┆ 28.0  │
│ 1981-09-25 ┆ 14.25 │
│ 1982-07-08 ┆ 11.0  │
└────────────┴───────┘

print(
    apple_df
    .group_by_dynamic("Date", every="1y").agg(pl.len())
    .select(pl.col("Date").dt.year().alias("year"), pl.col("len"))
)

shape: (34, 2)
┌──────┬─────┐
│ year ┆ len │
│ ---  ┆ --- │
│ i32  ┆ u32 │
╞══════╪═════╡
│ 1981 ┆ 4   │
│ 1982 ┆ 1   │
│ 1983 ┆ 3   │
│ 1984 ┆ 3   │
│ 1985 ┆ 3   │
│ …    ┆ …   │
│ 2010 ┆ 2   │
│ 2011 ┆ 2   │
│ 2012 ┆ 2   │
│ 2013 ┆ 2   │
│ 2014 ┆ 1   │
└──────┴─────┘

By default, group_by_dynamic creates contiguous, non-overlapping windows of length given by the parameter every. In the example above, those windows had a length of one year, as specified by the string "1y". The length of the window and the periodicity with which different windows are created can be specified separately if you provide values for the parameters every and period.

The snippet below creates 10-year windows starting every 5 years:

decades = apple_df.group_by_dynamic("Date", every="5y", period="10y").agg(pl.len())
print(decades)

shape: (7, 2)
┌────────────┬─────┐
│ Date       ┆ len │
│ ---        ┆ --- │
│ date       ┆ u32 │
╞════════════╪═════╡
│ 1980-01-01 ┆ 28  │
│ 1985-01-01 ┆ 33  │
│ 1990-01-01 ┆ 35  │
│ 1995-01-01 ┆ 36  │
│ 2000-01-01 ┆ 28  │
│ 2005-01-01 ┆ 20  │
│ 2010-01-01 ┆ 9   │
└────────────┴─────┘

There are two important things to note here. First, note how each row shows the number of data points that fall within a given decade, but because the windows overlap, the total sum is larger than the total number of original data points:

print(apple_df.shape)  # (100, 1)
print(decades["len"].sum())  # 189

Second, Polars set the boundaries of the windows based on the earliest and latest data points it finds, but by default it aligns the boundaries neatly. In this example, the earliest data point is from 1981, but Polars aligns the boundary of the first window with the start of the decade in 1980:

Diagram showing how the window boundaries align neatly with the decades and the decade halfpoints, 1980, 1985, 1990, etc, although the first datapoint is from 1981.

If you prefer to align the first window with the first data point, you can do so by specifying start_by="datapoint":

Diagram showing how the window boundaries shift when we set the parameter start_by to the value datapoint, making it so that the very first window starts exactly on the first datapoint.

In case the heuristic that Polars uses to align window boundaries doesn’t suit your needs and you want greater control over the start of the first window, you can specify the parameter offset:

Diagram showing how the window boundaries can be adjusted by specifying the parameter offset.

There is more to the function group_by_dynamic, so if you want to learn more about it be sure to check the API reference page on group_by_dynamic.

Time interval-based computations

The final function that you will learn about in this article is the function rolling. The function rolling is somewhat similar to group_by_dynamic, in that you can use it to aggregate values based on time windows. The key difference is that the function rolling always creates one window per data point:

Diagram showing how the window boundaries align with the data points and how each data point has a respective window.

In other words, you have no control over the periodicity of the windows.

Like with group_by_dynamic, a single data point may belong to multiple windows:

decades = apple_df.rolling("Date", period="10y").agg(pl.len())
print(decades)
print(decades["len"].sum())  # 2793

shape: (100, 2)
┌────────────┬─────┐
│ Date       ┆ len │
│ ---        ┆ --- │
│ date       ┆ u32 │
╞════════════╪═════╡
│ 1981-02-23 ┆ 1   │
│ 1981-05-06 ┆ 2   │
│ 1981-05-18 ┆ 3   │
│ 1981-09-25 ┆ 4   │
│ 1982-07-08 ┆ 5   │
│ …          ┆ …   │
│ 2012-05-16 ┆ 22  │
│ 2012-12-04 ┆ 21  │
│ 2013-07-05 ┆ 22  │
│ 2013-11-07 ┆ 23  │
│ 2014-02-25 ┆ 23  │
└────────────┴─────┘
2793

In this case, windows overlap more in the example with rolling than in the example with group_by_dynamic, where the total lengths of all windows was 189.

Remember that you cannot manipulate the total number of windows created by rolling. The only thing you can control is the length of the time window. If you fix the length of the time window, in group_by_dynamic you can influence the number of windows created by playing around with the value of the parameter every. If we set every to a value that is much smaller than what we used in our previous examples, we can get more overlapping windows in group_by_dynamic:

overlaps = apple_df.group_by_dynamic("Date", every="3d", period="10y").agg(pl.len())
#                                                  ^^^^ 3 DAYS
print(overlaps["len"].sum())  # 102893

You can also use rolling as an expression, as shown in the next example. Using the Apple stock data from before, you can use the expression rolling to compute by how much the stock price was above the mean price over the previous 3 years:

print(
    apple_df.with_columns(
        (
            pl.col("Close").mean().rolling(index_column="Date", period="3y")
            - pl.col("Close")
        ).alias("Diff_to_3y_mean")
    )
)

shape: (100, 3)
┌────────────┬────────┬─────────────────┐
│ Date       ┆ Close  ┆ Diff_to_3y_mean │
│ ---        ┆ ---    ┆ ---             │
│ date       ┆ f64    ┆ f64             │
╞════════════╪════════╪═════════════════╡
│ 1981-02-23 ┆ 24.62  ┆ 0.0             │
│ 1981-05-06 ┆ 27.38  ┆ -1.38           │
│ 1981-05-18 ┆ 28.0   ┆ -1.333333       │
│ 1981-09-25 ┆ 14.25  ┆ 9.3125          │
│ 1982-07-08 ┆ 11.0   ┆ 10.05           │
│ …          ┆ …      ┆ …               │
│ 2012-05-16 ┆ 546.08 ┆ -211.831667     │
│ 2012-12-04 ┆ 575.85 ┆ -173.365        │
│ 2013-07-05 ┆ 417.42 ┆ 15.306667       │
│ 2013-11-07 ┆ 512.49 ┆ -68.368571      │
│ 2014-02-25 ┆ 522.06 ┆ -49.152857      │
└────────────┴────────┴─────────────────┘

The default behaviour of aligning the end of the window with the datapoint can be customised by passing a value to the parameter offset. This value is added to the datapoint to determine the beginning of the window, which means that the default value for the parameter offset is -period.

As an example, if you wanted your windows to be centered around the datapoints, you would specify an offset that is half the period:

Diagram showing how the window boundaries align with the data points when the offset is taken into account, allowing you to shift the windows.

By using sufficiently large values for the parameter offset, you can also create windows that do not contain the respective datapoint, either because the windows end before the respective datapoints, or because they start after.

Conclusion

Several Polars functions and expressions support a specification for time durations that is flexible and that mimics the way humans reason about time periods. In this article you have taken a look at date_range, offset_by, group_by_dynamic, and rolling, but you can find more by exploring the API reference documentation.

Note: ⚠️ This blog article was written with Polars 1.17.1.

It is up to you to determine in what row the addition of "1d" is equivalent to adding 23 hours and which one is equivalent to adding 25 hours. ↩