Using Row Groups for fast filtering of large parquet files PyData Tel Aviv 2024

Using Row Groups for fast filtering of large parquet files
.ical

11-04, 15:15–15:45 (Asia/Jerusalem), Blue Track

Deep learning algorithms often require the use of multiple large datasets, with the parquet format being one of the most prevalent formats to store big data. While datasets are typically consumed by batch processing where several tasks process small slices of the data, big data query engines like Trino and Spark are more efficient when working on large chunks of data. In this talk we will present a solution based on the Row Groups feature that allows both small and large amounts of data to be consumed quickly.

The Apache parquet file format offers convenient columnar-based data access and is popular in both Python-based (i.e., Pandas) and big data frameworks. In this talk, we will dive into the details of how parquet files are saved, advantages and disadvantages of this format, and how the Row Groups feature can solve performance challenges and reduce cloud costs, providing a good balance between column and row-wise access (necessary in cloud object storage solutions such as AWS S3, for example). In addition to discussing some technical background, we will also demonstrate how it can be implemented efficiently.

Uri Mogilevsky-Schay

Menachem Kluft

Menachem Kluft - Senior Backend Developer in Mobileye.
Uri Mogilevsky - Technical lead in Mobileye.
We're part of a group creating a data processing infrastructure. This system serves different groups, addresses various needs, and tackles challenging data processing tasks across the entire company.

Using Row Groups for fast filtering of large parquet files .ical 11-04, 15:15–15:45 (Asia/Jerusalem), Blue Track

Using Row Groups for fast filtering of large parquet files
.ical

11-04, 15:15–15:45 (Asia/Jerusalem), Blue Track