PyData Tel Aviv 2023

Building a Reproducible LLM Fine-Tuning Pipeline with Hugging Face Transformers
11-14, 15:00–15:30 (Asia/Jerusalem), Red Track

In this talk we’ll dive into how the ML vs LLM approaches differ (model training vs fine tuning), what you need to know when employing each of these methods, and why reproducibility is important for fine tuning for both LLMs and machine learning in general. This will be demoed this through a real code example of the popular Python Hugging Face Transformers repo, and the critical pieces that impact reproducibility––code and environment as well as data and model.


Anyone in the data engineering space has been watching the development around LLMs (large language models), nearly as closely as generative AI. While LLMs represent a huge leap in AI capabilities, they require different handling: with very large base models, most ML practitioners would adapt an existing foundation model instead of building a new one from scratch.

In this talk we’ll dive into how these two approaches differ (model training vs fine tuning), what you need to know when employing each of these methods, and why reproducibility is important for fine tuning for both LLMs and machine learning in general. We’ll demo this through a real code example of the popular Python Hugging Face Transformers repo, and the critical pieces that impact reproducibility––git (code and environment) and lakeFS (data and model).

Oz Katz is the CTO and Co-founder of Treeverse, the company behind lakeFS, an open source platform that delivers resilience and manageability to object-storage based data lakes. Oz engineered and maintained petabyte-scale data infrastructure at analytics giant SmilarWeb, which he joined after the acquisition of Swayy.