Complex Calculations vs. Polars Plugins - A Double River Case Study

Wed, 14 Aug 2024

Introduction

This case study is written by Isaac Robbins and Nelson Griffiths. Isaac is a data engineer at Double River Investments, where he oversees database management and data pipelines, and develops tooling for interacting with data systems. Nelson is the Engineering and ML Lead at Double River and an active member of the Polars Discord. Over the past months Nelson has inspired many users to explore Polars Plugins as he and his team have achieved incredible speed-ups and memory improvements. With their successes with Polars plugins, they are excited to share their experiences here.

About Double River

Double River is a start-up hedge fund focused on doing quantitative analysis. We have a small team of 3 engineers and support a research team with ~10 analysts. As part of the system’s team, we strive to provide robust and flexible infrastructure to handle daily pipelines that process financial data and support the research team so they can focus on effective analysis that eventually feeds back into our daily pipelines. Over the past 2 years, Polars has become core to achieving our goals at Double River by giving us increased performance, a lower memory footprint, and a faster and more comprehensive development experience.

Learning to walk: Hello `Polars`

In the beginning, we had only heard whispers of the benefits of some mysterious, blazingly fast Python library that promised better everything when it came to working with data. Our systems team had significant experience using pandas and NumPy and used them for prototyping the first pieces of our data pipelines and trading strategy. Things were simpler back then. Simple calculations like calculating returns from t-1 to t and missing value counts from different datasets. As we stumbled through the exploratory phase of building out our tech stack and how to manage our pipelines, we began to explore Polars. Initially, we may have started testing it out because it written in Rust which was good enough for us, but we soon started realizing that its claims were true, at least in our limited experience.

It started with one of our engineers.

Have you tried Polars yet?

It's only been a week and I think I'm better at Polars than I am at Pandas!

I filtered a DataFrame without suppressing warnings!

Then another. The same wonderful experience.

Within a month or so, it was clear that Polars was the way forward at Double River and that everyone, including our analysts, would need to adopt it. No one wanted to be seen writing pandas in the office anymore.

But, while we were standing up the first pieces of our production pipelines and testing out Polars, our research team was busy working on the more complicated and compute intensive stuff. They were implementing what we’ll call model A and model B, both involving significant amounts of data (memory) and processing (compute). They were using pandas, NumPy, and statsmodels simply because that’s what they were comfortable with. By the time we had everyone on board with Polars, most of these two pieces were too far developed to warrant an immediate rewrite. As an organization, our focus was an MVP. Something in production was more important than something better not in production. Over the next year, we released our MVP, started actually trading, and began iterating. We had transitioned most things to Polars and everyone was now familiar with it. But, in terms of time to run and size of infrastructure needed, there were still two pieces of our pipelines that stuck out like sore thumbs: model A and model B.

Once we were comfortable with Polars and our other workflows had steadied, it was time to face our demons.

Learning to run: Hello `Polars` plugins

Model A

Our first pass of our implementation of model A made use of pandas for data processing and statsmodels for regressions. By the time it was ready for production we had replaced pandas with Polars for the data processing, which resulted in some noticeable improvements but these were outweighed by some bigger issues. The nature of model A is complex but suffice it to say that the number of regressions required is high. Like really high. These took up the vast majority of time to run the model.

Our model needed to be run each day, in entirety, from scratch. Unfortunately, we can’t simply store previous values and reuse them. Furthermore, due to data availability, we have a relatively short window of time in which we needed to run our models so that all data gets processed before the markets open. Using the statsmodels implementation of the regressions, our daily run would take around 120 minutes and use around 75GB of memory. Being fully on board with the benefits of Rust and its ability to be run as a Python package, we wrote our own implementation for the regressions in Rust, imported it as a Python package, and swapped out the statsmodels implementation. Running on the same machine, we saw immediate improvements resulting in each run taking around 80 minutes, again using around 75GB of memory. We did nothing special with the implementation, rather, with the calculations in Rust, we were able to take advantage of parallelization. While this was a huge win, there was still one remaining issue, the memory. In order to run the regressions we had to transform all of the data from Polars to NumPy to Rust and then transform back from Rust to NumPy arrays and then back into Polars.

Just then, like an angel from the heavens, the Polars team announced plugins. With a little bit more Rust and a bit of help from the Polars team, we were able to rewrite our regressions as a plugin. There are many benefits from using a plugin but two really stood out to us:

The data stays in Polars, keeping memory down.
Get all the advantages of Polars parallelization for free.

With our new fancy Polars expression for doing our regressions, we had slashed our memory down to 42GB and even saw the time drop down to 9 minutes. Yes, you read that right. We went from nearly two hours to nine minutes. We were able to run our model in a fraction of the time and with a a little over half the memory.

Comparision of implementations

Implementation	Time (minutes)	Memory (GB)	Relative Time	Relative Memory
Statsmodels	120	75	1.0	1.0
Custom Rust+Numpy	80	75	0.67	1.0
Polars plugin	9	42	0.075	0.56

Lest anyone thinks our initial implementation of the regressions was optimized, we can assure you, it was not. We could have certainly done better with speed and memory. But we are a startup sprinting towards a product and wrote a simple and straightforward implementation. What we found though, was that with Polars plugins, we didn’t have to spend time in the weeds of optimization, we could do a mediocre job at rewriting it in Rust (which is quickly becoming a favorite pastime of ours). Crisis averted, and it led to amazing results. We ended up with a mission critical pipeline that ran much faster and used far less memory on smaller, cheaper machines. It was a win-win-win. And it was just the beginning of our journey with Polars plugins.

Learning to fly: Complex `Polars` plugins

Model B

Model B is yet another example of us choosing to implement amazing algorithms that will never win awards for computational efficiency. In this case it required iteratively making and updating predictions for each step of a time series. Our initial implementation was written in NumPy and was computationally efficient, but ran on a single group at a time. Instead of throwing more compute at the problem and dealing with Python’s multiprocessing, we chose to harness Polars plugins. So, having successfully written our first plugin for the regressions in model A, all we wanted to do was write more. Fortunately for us, model B just sat there every morning, running for 80 minutes, begging to be reimplemented as a plugin. We soon obliged.

This implementation required quite a bit more care, as the number of steps and operations had grown significantly and there was a lot more to keep track of. But once we had figured it out it was glorious. The time it took to run model B went from 75 min to 6 min and the memory usage dropped from 41GB to 19GB.

Comparison NumPy and Polars plugin

Implementation	Time (minutes)	Memory (GB)	Relative Time	Relative Memory
NumPy	75	41	1.0	1.0
Polars plugin	6	19	0.08	0.46

While we did have to rewrite our implementation of model B in Rust as a plugin for Polars , we did not change how it was implemented. There were no algorithmic tricks or improvements to eek out some extra performance. Just the same set of linear algebra instructions in a different language. We had gotten a 12x improvement in speed and a 50% improvement in memory by simply changing the context that these calculations were being run in from Python and NumPy to Rust and Polars.

Conclusion

Polars is now one of the core beliefs at Double River. Whatever that means, we have fully embraced it. At first, it gave us some extra speed and a bit more memory to play with. Soon, this translated into being able to run our data pipelines on significantly smaller infrastructure. And if that wasn’t enough, our development times have gone down since Polars is simply easier to write. Eventually, the kind people at Polars released plugins and that meant that we could take advantage of all of its amazing capabilities where it mattered most by tailoring it to our very specific needs.

In doing so we saw a decrease in time to run our models which led to a 60% decrease in time to run our entire pipeline. We saw the memory usage of our most complex models drop by ~50%. All of these things combined have led to huge cloud savings and a much better product for our research team.

It’s now hard to imagine us without Polars. We are now able to and have done things that we previously didn’t even think possible. So here’s to Polars and its team for their product and continued support!

Standardizing around Polars has been a huge win for us. The speed and memory benefits have been obvious. The best part though is how easy it is to read and write. The collaboration between teams at Double River is better than ever. We are all speaking the same language and it's Polars.

Nelson Griffiths
Engineering and ML Lead @ Double River

It's honestly insane how much of a difference using Polars can and does make. Developing with it is so intuitive. Speed and memory wise, it excels. And with being able to extend it without limits through plugins, Polars is the definitive tool for working with data.

Isaac Robbins
Data Engineer @ Double River

Complex Calculations vs. Polars Plugins - A Double River Case Study

Introduction

About Double River

Learning to walk: Hello Polars

Learning to run: Hello Polars plugins