PyData Tel Aviv 2024
Sometimes we have to share our data with some 3rd party. Government or business partner. But we have to protect our data, so the solution is anonymization. It is a known technique used to protect individual privacy and comply with data protection regulations in various domains. However, this seemingly benign process of removing direct identifiers from datasets may not guarantee true anonymity. This session aims to shed light on the potential dangers through real-world examples and case studies, participants will gain an understanding of the vulnerabilities of anonymized data and learn best practices to ensure real data anonymization.
APL is renowned for its influence on array programming paradigms. Its design philosophy carries valuable insights that can aid Python developers in effectively using NumPy, the fundamental package for numerical computation in Python. This talk will explore the ideas derived from APL to optimize array manipulations in NumPy and to encourage the use of more efficient, expressive, and elegant programming patterns.
In this talk we will share how our SDK empowers ML developer to build powerful, customized data apps—without needing deep expertise in cloud infrastructure or data engineering.
Join us in this introductory lecture on Natural Language Processing (NLP). Delve into a captivating exploration of the key milestones that have shaped the advancement of NLP, highlighting its relevance and practical applications outside the NLP field. By providing a comprehensive overview of NLP's evolution, including notable developments, breakthroughs, and existing limitations, this lecture equips data scientists with a fundamental understanding of NLP's dynamic world. Discover the profound impact of NLP on the field of research and gain valuable insights to enhance your data analysis endeavors.
This talk navigates from Recurrent Neural Networks to Generative Pretrained Transformers (GPTs), with a primary focus on understanding attention mechanisms. We start with the building blocks: The Perceptron and RNN cells, and after identifying the issues that arise with RNNs, we delve into attention mechanisms, examining their role in translation tasks and leading into a detailed dissection of self-attention. The culmination is a comprehensive review of Transformer models and the GPT series, their performances, and their capacities in zero-shot, one-shot, and few-shot learning with prompts.
In this practical talk about how to extend JupyterLab, we focus on understanding the underlying extension support infrastructure. As we walk through a step-by-step example of creating an app in JupyterLab, we'll learn, among other things, how to launch that app from different places within JupyterLab, how to style our app, and how to pass parameters to our app to modify its behavior. This talk is for anyone who finds themselves doing complex or repetitive tasks and thinks that they, and others, may benefit from integrating those tasks into JupyterLab.
If you're drowning in data but short on time, this talk is for you. We'll explore EDA methods for the 'lazy engineer,' showcasing how to accelerate insights by automating your data exploration. From leveraging automated reporting libraries like Ydata-Profiler to using ML Clustering algorithms - going from raw data to distinct clusters, and finally enhancing insights with the power of LLMs. Join me for a practical, end-to-end guide to optimizing EDA, agnostic to data types, and boosting productivity through smart automation.
In the ever-evolving landscape of healthcare, doctors face an ongoing challenge: how to access vital medical information about their patients buried deep within databases. Traditional methods have proven time-consuming and often fall short of providing the comprehensive answers doctors need. But what if I told you that AI, SQL, and GraphQL have walked into fertility clinics, offering a groundbreaking solution?
In my presentation I explore the innovative use of Large Language Models (LLMs) in medical feature development. I introduce a novel approach that leverages LLMs to translate doctors' intricate questions into SQL and GraphQL queries, enabling prompt and accurate retrieval of patient data. The result? A revolution in the way doctors access and utilize critical information to make informed decisions.
Join me at the development table as we uncover the objectives behind crafting the "chatting with my medical database" feature. Together, we'll unravel how LLM-based Python chains became integral to this feature and how GraphQL emerged as the superhero, leaving SQL in the dust. We will dive deep into the key development considerations that influenced our choices, encompassing security, flexibility to handle diverse inputs, and reliability in providing doctors with answers to their questions.
This lecture on Tackling Prompt Injection focuses on addressing the challenges posed by biased, misleading, or unethical prompts in language models, and the utilization of the LangChain framework to tackle this effort. Prompt injection has emerged as a critical concern affecting the reliability, fairness, and ethical use of language models. In this lecture, we explore innovative methodologies, techniques, and strategies to detect, mitigate, and prevent prompt injection. We delve into the quantitative and qualitative evaluation of prompt injection vulnerabilities, the resilience of language models against adversarial attacks, and the ethical considerations in prompt design and usage. Moreover, we showcase the LangChain framework as a powerful tool to secure language models against prompt injection, ensuring trustworthiness and integrity in data-driven decision-making. Join us for this enlightening lecture and discover how to combat prompt injection and leverage LangChain to enhance data science applications.
You probably know how to work with integers, floats, and even strings in Pandas. But Pandas offers a huge amount of functionality having to do with dates and times — much of which, I've found, is unknown even to veteran Pandas users. In this talk, I demonstrate to you what Pandas can do to display, retrieve, sort, group, and format your date-related records.
This presentation will showcase a replicable architecture for how to build your own AI-automation stack using Python, with a real-world example as a guide.
Imagine the newest medical prediction algorithm is claiming you have high-risk for some health condition. I bet the first thing going through your mind is "well, what can I do to reduce it".
Regular prediction is not always enough, we often care about predicting the consequences of several paths of action we can take - the causal effect of these actions.
In this talk I will briefly present causal inference - the science of estimating causal effect of actions using observational data and how it differs from regular prediction. I will overview models for estimating causal effect and how to apply them with causallib - a one-stop-shop open-source Python package for flexible causal inference modeling.
Pandas and SQL are both great tools for data manipulation and analysis, yet they each come with their unique challenges. Pandas is handy and easy to use, especially for iterative research, but can slow down or crash with anything larger than a few gigabytes. SQL is efficient with huge datasets, but can be tricky and rigid when your analysis becomes more complex.
In our talk, we'll explore how these tools, instead of working against each other, can complement each other in your data science work. We'll show you how ibis can bring the python experience with the power of SQL environment, helping you get the best of both worlds.
There are rumors that pop music has poorly-written lyrics. To tackle these claims, we wanted to automatically find the most repetitive songs published in Hebrew.
By combining scraping, data analysis, visualization and web development, all in Python, we successfully demonstrate what is the most repetitive song of all time, and also analyze pop-chart trends over the years, by genre and by artist.
In this live coding session you'll learn how to integrate the data you care about into ChatGPT. As a result, you'll be able to ask ChatGPT questions and get answers based on your data!
In this session, I will define the term "biggish" data. After understanding the definition of this new term and why it is so important, I will discuss and show how utilizing DuckDB in Python creates a whole new set of possibilities when working with data. There are many ways to use DuckDB in Python and I want to share some of those with you.
As data analysts, we are often called to derive insights and action items from the feedback our users provide. Traditional analysis tools are great when the feedback is given to us in categorical formats - for example, yes/no or multiple choice responses, but often fall short when it comes to free-text. I will show how I used the BertTopic Python package to leverage the power of Deep Learning and Language Models in order to embed and cluster and visualise feedback texts in a way that tells a meaningful story, a story that sheds light on the pain points and desires of our end-users.
In this talk we’ll dive into how the LLM model fine tuning vs RAG approaches differ, what you need to know when employing each of these methods, and why reproducibility is important for both fine tuning and using RAG. This will be demoed through a real code example of the popular Python LangChain tool, Hugging Face Embeddings, Ollama’s LLM, and the critical pieces that impact reproducibility––code and environment as well as data and model.
In the world of data-driven decision-making, creating features from aggregated data is a common practice. However, the naive approach of iterating over large historical datasets for each calculation can be inefficient and time-consuming. Enter our Aggregation Engine: a mechanism for optimizing this process, enabling the reuse of historical aggregative data and preventing redundant recalculations. Join us in this talk as we unveil our design for this Aggregation Engine, walk through our Python implementation, and discuss how it helped us reduce the amount of time and fetched data required for feature calculations.
Introducing our novel "Two-Dimensional ARIMA", a transformative approach that captures emerging trends influencing distant time horizons in cohort revenue forecasting. By integrating cohort attributes, recent data, and seasonal patterns, we've pioneered a method ensuring unparalleled forecasting adaptability and accuracy.
An introduction to solving combinatorial optimization and constraint satisfaction problems in Python. I will review the most popular libraries for SAT/CSP. We will then deep dive to a crash coarse on using Google's award winning OR-tools library, for efficiently solving some non-trivial real-world constrained combinatorial optimization problems.
Deep learning algorithms often require the use of multiple large datasets, with the parquet format being one of the most prevalent formats to store big data. While datasets are typically consumed by batch processing where several tasks process small slices of the data, big data query engines like Trino and Spark are more efficient when working on large chunks of data. In this talk we will present a solution based on the Row Groups feature that allows both small and large amounts of data to be consumed quickly.
Comic Relief