Processing hundreds of GBs of textual data on a daily basis at MDPI

Thu, 27 Jun 2024

Introduction

After learning about the usage and impact of Polars at MDPI, we have asked MDPI to write a post to share their use case with our community. An interesting look at the data engineering approach of MDPI and the role Polars plays in it. This post was written by Jean-Philippe Dubuc, Senior Data Engineer at MDPI.

About MDPI

MDPI is a pioneer in open access scientific publishing based in Basel (CH), with offices all around the world. MDPI has the mission to foster open scientific exchange in all forms, across all disciplines. Since 1996, the company has published the research of more than 330 000 individual authors on 444 journals, and receive more than 25 million monthly webpage views.

MDPI AI

The journey of Polars at MDPI started two years ago with a data warehouse project to support the Artificial Intelligence and Analytics teams of the company. Most of our colleagues in these departments are Data Scientists or Machine Learning Engineers with little experience in pure data engineering. However, we see ourselves as a team of tech enthusiasts with a startup-like mindset, encapsulated within a large company.

Starting the project from scratch, we were not limited by an existing, outdated tech stack. Polars, which was in its early stages at the time, drew most of our attention.

Why we chose Polars

First, most of the tools that we use are open source, which aligns with the philosophy of MDPI being an exclusively open-access publisher.

The AI team at MDPI develops its applications exclusively in Python. Being able to run an optimized Rust library while keeping the benefits of the Python environment is truly a game changer. Anyone from the team can, and is encouraged to, step in and contribute to modifying transformation pipelines themselves.

Polars offers impressive speed & seamless integration with other tools. Massive fan of LazyFrames for efficient data manipulation!

Diogo Rodrigues, Senior Data Scientist @ MDPI

In the team, we also believe that moving to the cloud can be expensive and is not always necessary when the data size does not require it. Because we’re definitely not in the Big Data one percent, most of our architecture is on-premise, and most of our tools are open source.

Within our data warehouse platform, we process about 800GB of Parquet files daily, a good proportion of which is textual data. Some of our transformations include the inference of our AI language models, all orchestrated by Apache Airflow.

With Polars we have been able to redefine our data pipelines, quickly adapting to business and data changes without sacrificing performance. Polars is now an indispensable ally for our data pipelines and Data Hub.

Andrea Guzzo, AI Technical Leader @ MDPI

Because of that, Polars is a very good match for us, as we’re potentially avoiding the complexity and costs of moving to the cloud.

Polars syntax might have a learning curve for beginners or pandas enthusiasts, but taking the time to get familiar with it is definitely worth it. It does not leave any room for doubt and produces very clean and readable code, especially when combined with a linter.

The pace at which the Polars team releases updates is very impressive; every week or two, new features, performance improvements and bug fixes are released. By virtue of its continuous growth and improvements, we see Polars as a strategic partner who will help us achieve our goals of innovation and efficiency.

Thanks to Polars, we are able to handle complex datasets with ease while maintaining high standards of security and compliance.

Andrea Perlato, Head of Data Analytics @ MDPI

What made the difference

The ability to implement custom Polars plugins in Rust is invaluable. Since we process a lot of textual data for our NLP applications, we can create optimized functions to clean text or detect a language, with data being processed efficiently in batches. This level of customization is rarely seen in other typical processing engines and is even impossible when relying solely on SQL for data engineering and transformations.

Finally, the close relationship Polars has with the Arrow platform also eases its integration into a modern and efficient data engineering architecture. We’re able to leverage Arrow’s dataset partitioning features with Polar’s predicate pushdown for even more efficient batch processing and memory optimization.

By combining the ADBC driver directly from Polars to connect to databases, and using file formats like Parquet stored locally or in an on-premise MinIO S3 object store, we maintain a columnar architecture from one end to the other, moving data around at lightspeed.

Learn more about MDPI

If you are interested in what MDPI is working on, you can read more about an AI tool for researchers in scientific publishing they developed recently or have a look at their open positions.