PyData Tel Aviv 2024

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
08:00
08:00
60min
Registration and Breakfast
Red Track
08:00
60min
Registration and Breakfast
Green Track
08:00
60min
Registration and Breakfast
Blue Track
09:00
09:00
15min
Opening Words
Red Track
09:00
15min
Opening Words
Green Track
09:00
15min
Opening Words
Blue Track
09:15
09:15
30min
Keynote: The Dangerous Data Anonymization
Ran Bar Zik

Sometimes we have to share our data with some 3rd party. Government or business partner. But we have to protect our data, so the solution is anonymization. It is a known technique used to protect individual privacy and comply with data protection regulations in various domains. However, this seemingly benign process of removing direct identifiers from datasets may not guarantee true anonymity. This session aims to shed light on the potential dangers through real-world examples and case studies, participants will gain an understanding of the vulnerabilities of anonymized data and learn best practices to ensure real data anonymization.

he
Green Track
09:45
09:45
15min
Break
Red Track
09:45
15min
Break
Green Track
09:45
15min
Blue Track
10:00
10:00
30min
APL-Inspired Techniques for Advanced NumPy
Eran Krakovsky

APL is renowned for its influence on array programming paradigms. Its design philosophy carries valuable insights that can aid Python developers in effectively using NumPy, the fundamental package for numerical computation in Python. This talk will explore the ideas derived from APL to optimize array manipulations in NumPy and to encourage the use of more efficient, expressive, and elegant programming patterns.

he
Blue Track
10:00
30min
Empowering ML Developers with Self Serve Data Analytics
Roman Olshanskiy, David Katz

In this talk we will share how our SDK empowers ML developer to build powerful, customized data apps—without needing deep expertise in cloud infrastructure or data engineering.

he
Green Track
10:00
30min
Unveiling the Journey of Natural Language Processing (NLP): Milestones, Limitations, and Practical Applications
Ortal Ashkenazi

Join us in this introductory lecture on Natural Language Processing (NLP). Delve into a captivating exploration of the key milestones that have shaped the advancement of NLP, highlighting its relevance and practical applications outside the NLP field. By providing a comprehensive overview of NLP's evolution, including notable developments, breakthroughs, and existing limitations, this lecture equips data scientists with a fundamental understanding of NLP's dynamic world. Discover the profound impact of NLP on the field of research and gain valuable insights to enhance your data analysis endeavors.

en
Red Track
10:30
10:30
15min
Break
Red Track
10:30
15min
Break
Green Track
10:30
15min
Blue Track
10:45
10:45
30min
A Shallow Introduction to Self-Attention
Alon Oring

This talk navigates from Recurrent Neural Networks to Generative Pretrained Transformers (GPTs), with a primary focus on understanding attention mechanisms. We start with the building blocks: The Perceptron and RNN cells, and after identifying the issues that arise with RNNs, we delve into attention mechanisms, examining their role in translation tasks and leading into a detailed dissection of self-attention. The culmination is a comprehensive review of Transformer models and the GPT series, their performances, and their capacities in zero-shot, one-shot, and few-shot learning with prompts.

en
Red Track
10:45
30min
Adding Your Own Data Apps to JupyterLab
Daniel Goldfarb

In this practical talk about how to extend JupyterLab, we focus on understanding the underlying extension support infrastructure. As we walk through a step-by-step example of creating an app in JupyterLab, we'll learn, among other things, how to launch that app from different places within JupyterLab, how to style our app, and how to pass parameters to our app to modify its behavior. This talk is for anyone who finds themselves doing complex or repetitive tasks and thinks that they, and others, may benefit from integrating those tasks into JupyterLab.

en
Blue Track
10:45
30min
The TL;DR of EDA
Mor Hananovitz

If you're drowning in data but short on time, this talk is for you. We'll explore EDA methods for the 'lazy engineer,' showcasing how to accelerate insights by automating your data exploration. From leveraging automated reporting libraries like Ydata-Profiler to using ML Clustering algorithms - going from raw data to distinct clusters, and finally enhancing insights with the power of LLMs. Join me for a practical, end-to-end guide to optimizing EDA, agnostic to data types, and boosting productivity through smart automation.

he
Green Track
11:15
11:15
15min
Break
Red Track
11:15
15min
Break
Green Track
11:15
15min
Blue Track
11:30
11:30
30min
AI, SQL, and GraphQL Walk into a Fertility Clinic… LLM-based Medical feature development
Shirli Di-Castro Shashua

In the ever-evolving landscape of healthcare, doctors face an ongoing challenge: how to access vital medical information about their patients buried deep within databases. Traditional methods have proven time-consuming and often fall short of providing the comprehensive answers doctors need. But what if I told you that AI, SQL, and GraphQL have walked into fertility clinics, offering a groundbreaking solution?

In my presentation I explore the innovative use of Large Language Models (LLMs) in medical feature development. I introduce a novel approach that leverages LLMs to translate doctors' intricate questions into SQL and GraphQL queries, enabling prompt and accurate retrieval of patient data. The result? A revolution in the way doctors access and utilize critical information to make informed decisions.

Join me at the development table as we uncover the objectives behind crafting the "chatting with my medical database" feature. Together, we'll unravel how LLM-based Python chains became integral to this feature and how GraphQL emerged as the superhero, leaving SQL in the dust. We will dive deep into the key development considerations that influenced our choices, encompassing security, flexibility to handle diverse inputs, and reliability in providing doctors with answers to their questions.

en
Green Track
11:30
30min
Securing Language Models Against Prompt Injection with the Powerful LangChain Framework
Michael Ethan Levinger

This lecture on Tackling Prompt Injection focuses on addressing the challenges posed by biased, misleading, or unethical prompts in language models, and the utilization of the LangChain framework to tackle this effort. Prompt injection has emerged as a critical concern affecting the reliability, fairness, and ethical use of language models. In this lecture, we explore innovative methodologies, techniques, and strategies to detect, mitigate, and prevent prompt injection. We delve into the quantitative and qualitative evaluation of prompt injection vulnerabilities, the resilience of language models against adversarial attacks, and the ethical considerations in prompt design and usage. Moreover, we showcase the LangChain framework as a powerful tool to secure language models against prompt injection, ensuring trustworthiness and integrity in data-driven decision-making. Join us for this enlightening lecture and discover how to combat prompt injection and leverage LangChain to enhance data science applications.

he
Red Track
11:30
30min
Times and Dates in Pandas
Reuven M. Lerner

You probably know how to work with integers, floats, and even strings in Pandas. But Pandas offers a huge amount of functionality having to do with dates and times — much of which, I've found, is unknown even to veteran Pandas users. In this talk, I demonstrate to you what Pandas can do to display, retrieve, sort, group, and format your date-related records.

en
Blue Track
12:00
12:00
60min
Lunch
Red Track
12:00
60min
Lunch
Green Track
12:00
60min
Lunch
Blue Track
13:00
13:00
30min
AIOps for Security: SaaS Compliance Automation with a Python Stack
Tomer Doitshman

This presentation will showcase a replicable architecture for how to build your own AI-automation stack using Python, with a real-world example as a guide.

he
Blue Track
13:00
30min
Causal inference with Causallib
Ehud Karavani

Imagine the newest medical prediction algorithm is claiming you have high-risk for some health condition. I bet the first thing going through your mind is "well, what can I do to reduce it".
Regular prediction is not always enough, we often care about predicting the consequences of several paths of action we can take - the causal effect of these actions.
In this talk I will briefly present causal inference - the science of estimating causal effect of actions using observational data and how it differs from regular prediction. I will overview models for estimating causal effect and how to apply them with causallib - a one-stop-shop open-source Python package for flexible causal inference modeling.

en
Green Track
13:00
30min
Ibis framework - Making data science work at any scale.
Omri Fima

Pandas and SQL are both great tools for data manipulation and analysis, yet they each come with their unique challenges. Pandas is handy and easy to use, especially for iterative research, but can slow down or crash with anything larger than a few gigabytes. SQL is efficient with huge datasets, but can be tricky and rigid when your analysis becomes more complex.

In our talk, we'll explore how these tools, instead of working against each other, can complement each other in your data science work. We'll show you how ibis can bring the python experience with the power of SQL environment, helping you get the best of both worlds.

he
Red Track
13:30
13:30
15min
Break
Red Track
13:30
15min
Break
Green Track
13:30
15min
Break
Blue Track
13:45
13:45
30min
Identifying Repetitive Songs using LZ Compression
Geva Kipper

There are rumors that pop music has poorly-written lyrics. To tackle these claims, we wanted to automatically find the most repetitive songs published in Hebrew.
By combining scraping, data analysis, visualization and web development, all in Python, we successfully demonstrate what is the most repetitive song of all time, and also analyze pop-chart trends over the years, by genre and by artist.

he
Green Track
13:45
30min
Live Coding: ChatGPT Goes Beyond Its Knowledge Cut-Off With External Database Integration
Shuki Cohen, Yoel Zeldes

In this live coding session you'll learn how to integrate the data you care about into ChatGPT. As a result, you'll be able to ask ChatGPT questions and get answers based on your data!

he
Red Track
13:45
30min
Processing Biggish Data with DuckDB and Python
Yoav Nordmann

In this session, I will define the term "biggish" data. After understanding the definition of this new term and why it is so important, I will discuss and show how utilizing DuckDB in Python creates a whole new set of possibilities when working with data. There are many ways to use DuckDB in Python and I want to share some of those with you.

he
Blue Track
14:15
14:15
15min
Break
Red Track
14:15
15min
Break
Green Track
14:15
15min
Break
Blue Track
14:30
14:30
30min
BertTopic: From Free-Text feedbacks to Calls for Action.
Moran Reznik

As data analysts, we are often called to derive insights and action items from the feedback our users provide. Traditional analysis tools are great when the feedback is given to us in categorical formats - for example, yes/no or multiple choice responses, but often fall short when it comes to free-text. I will show how I used the BertTopic Python package to leverage the power of Deep Learning and Language Models in order to embed and cluster and visualise feedback texts in a way that tells a meaningful story, a story that sheds light on the pain points and desires of our end-users.

en
Green Track
14:30
30min
Building a Reproducible RAG Pipeline for a Q&A ChatBot with LangChain and Ollama
Isan Rivkin

In this talk we’ll dive into how the LLM model fine tuning vs RAG approaches differ, what you need to know when employing each of these methods, and why reproducibility is important for both fine tuning and using RAG. This will be demoed through a real code example of the popular Python LangChain tool, Hugging Face Embeddings, Ollama’s LLM, and the critical pieces that impact reproducibility––code and environment as well as data and model.

en
Red Track
14:30
30min
Optimizing Data-Driven Decisions: Introducing an Aggregation Engine for Efficient Feature Creation
Aviv Vromen

In the world of data-driven decision-making, creating features from aggregated data is a common practice. However, the naive approach of iterating over large historical datasets for each calculation can be inefficient and time-consuming. Enter our Aggregation Engine: a mechanism for optimizing this process, enabling the reuse of historical aggregative data and preventing redundant recalculations. Join us in this talk as we unveil our design for this Aggregation Engine, walk through our Python implementation, and discuss how it helped us reduce the amount of time and fetched data required for feature calculations.

he
Blue Track
15:00
15:00
15min
Break
Red Track
15:00
15min
Break
Green Track
15:00
15min
Break
Blue Track
15:15
15:15
30min
2D ARIMA: Capturing New Trends for Distant Time Horizons in Cohort Revenue Forecasting
Yonathan Guttel

Introducing our novel "Two-Dimensional ARIMA", a transformative approach that captures emerging trends influencing distant time horizons in cohort revenue forecasting. By integrating cohort attributes, recent data, and seasonal patterns, we've pioneered a method ensuring unparalleled forecasting adaptability and accuracy.

he
Green Track
15:15
30min
Let our optima combine!
Eyal Gruss

An introduction to solving combinatorial optimization and constraint satisfaction problems in Python. I will review the most popular libraries for SAT/CSP. We will then deep dive to a crash coarse on using Google's award winning OR-tools library, for efficiently solving some non-trivial real-world constrained combinatorial optimization problems.

en
Red Track
15:15
30min
Using Row Groups for fast filtering of large parquet files
Uri Mogilevsky-Schay, Menachem Kluft

Deep learning algorithms often require the use of multiple large datasets, with the parquet format being one of the most prevalent formats to store big data. While datasets are typically consumed by batch processing where several tasks process small slices of the data, big data query engines like Trino and Spark are more efficient when working on large chunks of data. In this talk we will present a solution based on the Row Groups feature that allows both small and large amounts of data to be consumed quickly.

he
Blue Track
15:45
15:45
15min
Break
Red Track
15:45
15min
Break
Green Track
15:45
15min
Break
Blue Track
16:00
16:00
15min
Standup Comedy
Red Track
16:00
15min
Standup Comedy
Blue Track
16:00
15min
Standup Comedy
Jonathan Harel

Comic Relief

he
Green Track
16:15
16:15
15min
Closing Words
Red Track
16:15
15min
Closing Words
Green Track
16:15
15min
Closing Words
Blue Track