Fluent, not native: agents translating pandas to Polars

By Thijs Nieuwdorp on Thu, 25 Jun 2026

For a growing number of developers, the first Polars they ever see was written by a language model. Some just ask them for advice on how to tackle certain transformations, while others haven’t programmed a Polars query themselves in months.

One of the leading business cases is that LLMs have made migration a quick win. A developer feeds in a pandas script, gets back Polars, and verifies that the output matches. The result is typically a pipeline that runs an order of magnitude faster on smaller, cheaper machines.¹ With a lower cost of translation, the return on investment is achieved even sooner.

The success of this business case depends on the quality of the translation an LLM can produce. The Polars API is expressive, and a good translation should reach for the construct that states each intent most directly, not merely code that runs. We wanted to know whether current models do that, or whether they fall back on pandas-shaped habits that happen to be valid Polars.

We had Claude Opus 4.8 translate a pandas corpus to Polars and checked the result for idiomatic use of Polars. Each case was translated in a fresh session: one shot. Most translations ran, the outputs matched, and the code read as idiomatic Polars. Two structural patterns remained: cases where the model translates the existing pandas structure rather than the Polars construct that expresses the intent directly. We packaged fixes into a skill, re-ran the exercise with it loaded, and measured the difference. The skill nudges the model in the right direction on those patterns, but it is not a magic bullet. Fable 5, Anthropic’s newest model, is not available to us outside the United States, so we could not include it.

The setup

To measure where things stand today, we translated three bodies of pandas code:

the 22 PDS-H benchmark queries (This is a benchmark derived from TPC-H. Learn more here), using the pandas implementations from polars-benchmark. By providing just the pandas queries as input, we have an easily verifiable benchmark, since we have equivalent Polars queries available.
an 11-case EDA corpus drawn from three public pandas teaching repositories², covering window functions, resampling, as-of joins, reshaping, string processing, binning, and missing data. These advanced patterns are not found in the PDS-H benchmark.
two real-world ETL notebooks from Kaggle, used for initial manual exploration. This gives larger real life examples to compare against.

Each case was translated by Claude Opus 4.8 in a fresh session: one shot, with a prompt that explicitly asked it to translate idiomatically to Polars. We verified every translation by comparing its output against the output of the original pandas code (and for PDS-H against the reference answers) with polars.testing.assert_frame_equal. We ran the same corpus through Claude Sonnet 4.6 for comparison. Those results appear in the appendix.

Translation quality today

A year ago, LLMs frequently mixed the pandas and Polars APIs, reaching for deprecated methods like groupby and with_column, and producing code that did not run. Today, 31 of 33 Opus 4.8 translations ran correctly and the verified outputs matched. The two exceptions were minor: a missing import in one case and a column-reference error in another. Opus attempts to translate Polars more idiomatically than earlier models did, restructuring rather than translating line by line, although it occasionally introduces small errors.

The patterns that were problematic in the past all translated cleanly:

Basics like group_by, with_columns, is_in land correctly.
Semi and anti joins are used where they fit.
Conditional aggregations stay native. map_elements did not appear in any of the PDS-H or EDA translations.
The lazy/eager distinction is respected and collect() appears where it should. In the lazy API you build up a query and Polars only runs it, with optimizations applied, once you call collect().
List-wrapped arguments, Python-side scalar arithmetic, and map_elements fallbacks did not appear. These were common tells in earlier models; the Sonnet 4.6 results in the appendix show what that looked like.

The patterns that remain are subtler. The code runs and the output is correct, but in a small number of cases the solution is structured the way the original pandas code was structured. The model knows the Polars API well, but does not always recognize when a different construct expresses the intent more directly.

Pattern 1: Translating structure instead of intent

This is the clearest case we found, and the hardest to eliminate. In this example, the source notebook annotates per-minute stock prices with 5-minute window aggregates. The notebook uses a common pandas recipe: create a second frame that resamples to 5-minute windows, then merge it back with merge_asof (an as-of join matches each row to the nearest preceding row in the other frame by key, here by timestamp).

The LLM translated that structure faithfully, using group_by_dynamic (Polars’ resample-style time bucketing) and an as-of join:

# The LLM's Polars translation
five_min = minute.group_by_dynamic("datetime", every="5m").agg(
    pl.col("close").mean().alias("window_price"),
    pl.col("volume").sum().alias("window_volume"),
)

result = minute.join_asof(
    five_min,
    on="datetime",
    strategy="backward",
    tolerance=timedelta(minutes=5),
)

# What idiomatic Polars looks like
result = minute.with_columns(
    pl.col("close")
    .mean()
    .over(pl.col("datetime").dt.truncate("5m"))
    .alias("window_price"),
    pl.col("volume")
    .sum()
    .over(pl.col("datetime").dt.truncate("5m"))
    .alias("window_volume"),
)

dt.truncate("5m") rounds each timestamp down to the start of its 5-minute bucket, and .over() aggregates per bucket. The actual intent is “annotate each row with an aggregate over its 5-minute bucket”, and .over() expresses exactly that in a single context.

.over() is Polars’ window function

It is the equivalent of SQL’s OVER (PARTITION BY ...) which computes a group-level expression and broadcasts the result to each row in that group, all inside with_columns. Polars separates window functions from aggregations deliberately: group_by().agg() reduces the frame to one row per group, while .over() enriches each row of the original frame with a group-level value. See the window functions guide for the full set of mapping strategies.

Side-by-side comparison: .over() preserves the frame shape and broadcasts the group mean to every row, while group_by().agg() reduces the frame to one row per group

Interestingly, when the pandas source uses transform, the model maps it to .over() without trouble. pandas can express this very case that way, in one step with groupby(pd.Grouper(key="datetime", freq="5min")).transform(...), and a notebook written like that would most likely have been translated to .over() directly. It is when the source builds an aggregate frame and joins it back that the model translates the workaround instead of the intent. This pattern appeared in 3 of 33 cases.

Pattern 2: Extracting scalars mid-pipeline

One PDS-H query filters rows against a threshold derived from the data itself: rows below a fraction of a part’s average quantity. In that translation, the LLM resolved the threshold first, extracting it as a Python float before using it in the filter.

# What LLMs produce
threshold = lf.select(pl.col("revenue").quantile(0.9)).collect().item()
result = lf.filter(pl.col("revenue") > threshold).collect()

# What idiomatic Polars looks like
result = lf.filter(
    pl.col("revenue") > pl.col("revenue").quantile(0.9)
).collect()

The second version works because aggregations broadcast inside filter. Broadcasting means a scalar value is multiplied to match the length of the column it is compared against, so the quantile is computed once and compared against every row.

.item() has its place: at the end of a pipeline, when you need a Python value for something outside Polars. But .item() only exists on materialized DataFrames, so using it mid-pipeline forces a premature collect(). What should be one query becomes two: the source is scanned twice, an intermediate result materializes, and the optimizer sees each half in isolation.

One other consistent observation: inside .agg(), the model uses the spelled-out form pl.col("x").sum() rather than the top-level shorthand pl.sum("x"). Both are valid and the choice is stylistic. It appeared in 48 places across the corpus and the skill reduced it to 40 occurrences.

Why the accent persists

The common thread across these patterns is that the model translates the pandas source one step at a time rather than taking a step back and reconsidering the intent behind the query. At each step, the safest move is to map a pandas operation to the Polars operation that does the same thing. That keeps the result anchored to the structure of the original, which is why a translation so often resembles the shape of the pandas script it came from, down to the intermediate frames and join-backs. Reformulating the whole query around a single .over() requires stepping back from the line in front of it and recognizing the intent, and the model does not consistently take that step. Non-idiomatic code that runs and produces the right answer generates no error or warning, no signal that anything is suboptimal that would push the model to translate differently. A faithful, statement-by-statement translation that passes is, by that measure, a success.

The accent is not unique to language models, either. Developers who come to Polars after years of pandas tend to carry their old patterns over until they learn the more idiomatic ways of writing Polars queries. Any migration from one tool to another starts from the source, and a line-by-line translation will carry the source’s structure into the result regardless of the target library. Library authors are left with a couple of options: publish idiomatic examples that land in the LLM’s training data (such as this post!), emit machine-actionable warnings the way PolarsInefficientMapWarning does, and ship guidance in a form agents can load. That last one is what we explored next.

Fixing the accent

The patterns mentioned above are fixable. These patterns, along with the broader translation notes from the PDS-H and EDA exercises, are packaged as a Polars skill. Loaded into an agent, it steers the model toward correct patterns without you having to prompt for them. The skill is freely available and hosted in a GitHub repo.

Does it help?

We re-ran all 33 translations with the skill loaded, running the same model with a fresh session per case. This is a small-scale experiment, one translation per case on a single model, so treat the numbers as a directional signal rather than a benchmark. The effect held across all 33 cases.

The skill is not a magic bullet that solves everything, but it nudges the LLM in the right direction. Structural patterns improved: join-back cases dropped from 3 to 1, and the one mid-pipeline .item() case resolved to inline broadcasting. Correctness moved from 31 to 32 correct, with one skill-induced regression: the skill’s lazy-API guidance caused .collect() to be called on an already-eager frame. The aggregation shorthand tell barely moved: 48 occurrences dropped to 40, a 17% reduction.

Here is what the structural improvement looks like in practice. PDS-H query 17 filters line items against a threshold derived per part: 20% of that part’s average quantity. Without the skill, the model built the per-part average as a separate frame and joined it back. With the skill loaded, the per-group threshold becomes an .over() window inside the filter, and the separate aggregate frame and the join-back collapse into a single pass:

# Without the skill
jn = filtered_part.join(
    lineitem_ds, left_on="p_partkey", right_on="l_partkey"
)

avg_qty = jn.group_by("p_partkey").agg(
    (0.2 * pl.col("l_quantity").mean()).alias("avg_quantity")
)

avg_yearly = (
    jn.join(avg_qty, on="p_partkey")
    .filter(pl.col("l_quantity") < pl.col("avg_quantity"))
    .select((pl.col("l_extendedprice").sum() / 7.0).round(2).alias("avg_yearly"))
)

# With the skill
avg_yearly = (
    filtered_part
    .join(lineitem_ds.lazy(), left_on="p_partkey", right_on="l_partkey")
    .filter(
        pl.col("l_quantity") < 0.2 * pl.col("l_quantity").mean().over("p_partkey")
    )
    .select((pl.col("l_extendedprice").sum() / 7.0).round(2).alias("avg_yearly"))
)

The structural patterns improved, but did not all disappear. Two of the three join-back cases resolved. The stock-price case still builds the bucket aggregate and joins it back. The spelled-out aggregation tell stayed largely in place. Spelling-level rules land more reliably than restructuring rules, and reducing the aggregation tell requires restructuring within .agg() calls. The full results are in the appendix.

How to use it

Installing the skill

The skill is a SKILL.md file with supporting reference files. We distribute it as a Claude Code plugin so assistants discover and load it on demand, with a manual clone as a fallback. The source and the issue tracker live in the skill repository.

Claude Code, via the plugin marketplace (recommended)

/plugin marketplace add polars-inc/skills
/plugin install polars@polars

Start a session. Claude Code loads the skill whenever a task matches its description. Type /polars:polars to invoke it explicitly.

Claude Code, manual install

Clone the repo and copy the polars/ directory into your skills folder:

# Project-level (this project only)
git clone https://github.com/polars-inc/skills
cp -r skills/polars .claude/skills/

Use ~/.claude/skills/ instead to enable it across all your projects. After a manual install the skill command is /polars (no plugin namespace).

Cursor, Codex, and Copilot

The skill is a directory (SKILL.md plus reference files), so these tools install it by copying the polars/ directory into their skills folder. Each tool reads from a different path. The skill README has the exact per-tool steps for Cursor, Codex, and Copilot.

Once loaded, ask for a translation or for a review of existing Polars code. No extra prompting is needed.

It is a snapshot of what current models get wrong, distilled from one set of experiments, and it will age as models improve and as Polars grows. If you try it on your own pandas code and the model still writes with an accent, open an issue with the pattern you found. That feedback flows directly into the next revision of the skill.

Conclusion

Claude Opus 4.8 translated three bodies of pandas code to Polars that ran and matched the original outputs in 31 of 33 cases, with two minor slips. Most translations read as idiomatic Polars. What remains is a faint accent: two structural patterns (joining aggregate results back where .over() fits, and extracting scalars mid-pipeline).

A skill reduces the structural patterns. The aggregation case is tough to completely eliminate, because that rule requires restructuring inside .agg() calls and restructuring rules land less reliably than spelling-level rules do. Accented code runs and returns correct results, but it can be harder to read and maintain, and the non-idiomatic forms could leave performance unrealized.

The skill is worth loading for the structural patterns. We’d recommend to verify translations with assert_frame_equal regardless of whether the skill is loaded. You can find the skill in this repo. It is public and we will maintain it going forward. You can get involved by opening issues for mistakes you run into, or you can create pull requests to help fix any flaws you may encounter in your translations.

In a follow-up post we will look at pipeline migration strategies, both with and without LLMs.

Appendix

Skill contents

A sample of the rules in the skill:

Wrong	Correct	Notes
`.agg([expr, expr])` list form	`.agg(expr, expr)`	positional args everywhere
`pl.col("x").sum()` inside `.agg()`	`pl.sum("x")`	top-level shorthand
`.select(...).item()` mid-pipeline	cross join the one-row aggregate	keeps the plan lazy
`map_elements(lambda v: mapping[v])`	`.replace_strict(mapping)`	~19x faster at 5M rows in our test, stays in engine

The trajectory: Claude Sonnet 4.6

Sonnet 4.6 ran on the same corpus. Every translation ran and the outputs matched in all 33 cases. The accent was heavier: in addition to the structural patterns Opus still shows, Sonnet produced spelling-level tells that Opus has already shed.

List-wrapped arguments. select, group_by, agg, and with_columns all take positional arguments, but 21 of the 33 Sonnet translations (64%) passed lists instead.

# What Sonnet produces
result = df.group_by(["l_returnflag", "l_linestatus"]).agg(
    [
        pl.col("l_quantity").sum().alias("sum_qty"),
        pl.col("l_quantity").count().alias("count_order"),
    ]
)

# What idiomatic Polars looks like
result = df.group_by("l_returnflag", "l_linestatus").agg(
    pl.sum("l_quantity").alias("sum_qty"),
    pl.len().alias("count_order"),
)

The lists are valid, but the extra brackets are a reliable tell that the code was machine-translated from pandas. With the skill loaded, this disappeared entirely.

map_elements fallbacks. One of the ETL notebooks maps country names to ISO codes through a Python dictionary. Sonnet kept the dictionary lookup in Python:

# What Sonnet produces
df = df.with_columns(
    pl.col("country_name")
    .map_elements(lambda name: country_codes[name], return_dtype=pl.String)
    .alias("country_code")
)

# What idiomatic Polars looks like
df = df.with_columns(
    pl.col("country_name").replace_strict(country_codes).alias("country_code")
)

The map_elements version round-trips every value through the Python interpreter. replace_strict takes the same dictionary and performs the lookup inside the engine: in our measurement, about 19x faster at 5 million rows. Polars even detects this case at runtime and emits a PolarsInefficientMapWarning that names the exact replacement. But in an agent workflow, where the code runs once and the warning scrolls by, nobody acts on it. With the skill loaded, this case translated as replace_strict in a fresh session.

Python-side scalar arithmetic. Two Sonnet queries used Python’s round() on an extracted scalar where the .round() expression would keep the result inside the engine. With the skill, both cases resolved to native expressions.

The skill helped Sonnet far more. Total wrong-pattern instances dropped from 80 to 23 across all 33 cases, a 71% reduction. Most of that improvement came from the spelling-level tells: list-wrapping disappeared entirely, map_elements fallbacks disappeared, and Python-side round() disappeared. The structural patterns improved but persisted in some cases, matching the same behavior seen in the Opus run. The same skill, applied to the earlier model, had a much larger effect because most of Sonnet’s tells were spelling-level and spelling-level rules land reliably.

The fastest way to discover whether a fitting expression exists is to search the Polars API reference, which groups expressions by what they operate on. At runtime, PolarsInefficientMapWarning names the exact replacement when one exists, so a translated script is worth running once with warnings visible.

Results

We ran all 33 translations (the 22 PDS-H queries and the 11 EDA cases) on both models, in both arms: one shot per case, fresh session, skill the only difference between arms. This is a small corpus and a single run per case. The counts below are a directional signal, not a benchmark. The two ETL notebooks were left out of the count because they have no automated verification harness.

Pattern	Sonnet no skill	Sonnet with skill	Opus no skill	Opus with skill
Correct (33 total)	33/33	33/33	31/33	32/33
Aggregate + join-back where `.over()` fits	5 cases	2 cases	3 cases	1 case
Mid-pipeline `.item()`	3 cases	2 cases	1 case	0 cases
Python fallback where a native expression exists	1 case	0 cases	0 cases	0 cases
List-wrapped args to `select`/`group_by`/`agg`/`with_columns`	21 cases	0 cases	0 cases	0 cases
`pl.col("x").sum()` instead of `pl.sum("x")` inside `.agg()`	48 occurrences	19 occurrences	48 occurrences	40 occurrences
Python-side `round()` on extracted scalars	2 cases	0 cases	0 cases	0 cases

The PDS-H benchmark (May 2025) measures Polars roughly two orders of magnitude faster than pandas at scale factor 10, where pandas is the only run that completes before hitting out-of-memory failures at larger scales. See also the Polars vs PySpark benchmark for single-node versus distributed comparisons. ↩
wesm/pydata-book (MIT, 4 cases); stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition (MIT, 3 cases); guipsamora/pandas_exercises (BSD-3-Clause, 2 cases). ↩