PyData Tel Aviv 2024

Optimizing Data-Driven Decisions: Introducing an Aggregation Engine for Efficient Feature Creation
11-04, 14:30–15:00 (Asia/Jerusalem), Blue Track

In the world of data-driven decision-making, creating features from aggregated data is a common practice. However, the naive approach of iterating over large historical datasets for each calculation can be inefficient and time-consuming. Enter our Aggregation Engine: a mechanism for optimizing this process, enabling the reuse of historical aggregative data and preventing redundant recalculations. Join us in this talk as we unveil our design for this Aggregation Engine, walk through our Python implementation, and discuss how it helped us reduce the amount of time and fetched data required for feature calculations.


One of the most common ways of using data to make informed, data-driven decisions, is creating features based on aggregated data. For example, the amount of transactions a client did in their bank account for the past 6 months can be aggregated into a feature that can be later used when making the decision of whether to approve or decline a new transaction request. A naive solution for implementing these aggregative features would be to iterate over large amounts of historical data, on some periodic and on-demand basis, to calculate relevant aggregations. The process that calculates the total transactions in the last 6 months, for example, would need to fetch the entire 6 months of transactions from scratch every time this calculation is performed.
Our Aggregation Engine was designed to enable a better process, reusing historical aggregative data and preventing unnecessary recalculations of metrics. This engine enables processes to continuously calculate daily metric aggregations and store the values within a dedicated storage. The idea is to use these daily (stored) aggregations to calculate the final aggregation value and dramatically reduce the amount of fetched data required for the calculation as there will be no need to fetch all the historical data. Taking the previous example of a account’s total transactions in the last 6 months, in the new solution we will no longer fetch 180 days of data every time the feature needs to be recalculated; instead, we will fetch only the last 180 daily aggregations (at the most 180 rows from a much shallower table) and sum them up to get the final calculation result.

Aviv Vromen is an experienced ML and data infrastructure engineer with a strong background in Python. He is currently working at bluevine, where he has played a key role in the company's success in the financial technology sector. Prior to that, Aviv made a contributions as an algorithm developer at Rafael, focusing on complex multi-agent systems.
In his conference talk, Aviv aims to share his approach to using aggregated data in order to improve feature calculation.