Skip to content

Details

Welcome to a new live get together for the global MLOps.community in Amsterdam. Together with our hosts, Nebius, we will enjoy a series of talks and ample time to socialize with others in the community!

Get your free ticket through lu.ma: https://lu.ma/rzl1hpfc

Schedule
18.00-18.30: Walk in + drinks and bites
18.30-19.00: Fail fast & recover faster: infrastructure resilience of multi-node LLM training - Filipp Fisin (Senior MLE @ Nebius)
19:00-19:30: Realtime Standby Energy Waste Prediction - Luka Sturtewagen (Principal DE @ Sensorfact)
19:30-21.30: Networking + drinks and bites

Sign-up instructions:

  • Sign up in the meetup page
  • *NEW*: Get your free ticket via Lu.ma: https://lu.ma/rzl1hpfc
  • Let us know if you have any strict dietary restrictions (e.g. vegan🌱)
  • We are looking for speakers for the next events. If you would like to give a talk, let us know the topic and a contact information.

🎤 Talks

Talk 1:
Fail fast & recover faster: infrastructure resilience of multi-node LLM training
Speaker: Filipp Fisin, Senior MLE @ Nebius
Training an LLM model in a multi-node setup is a complex and expensive process. Training failures can't be eliminated, but downtime can be reduced.
In this talk, we provide an overview of techniques for more resilient training that we've found useful in our JAX-based multi-node training setup, namely:

  • multi-node training orchestration in Kubernetes via Argo with automatic failure recovery
  • a special type of Kubernetes health-checks to detect if a training process is stuck
  • techniques to efficiently save and load terabyte-scale checkpoints
  • XLA compilation cache
  • GPU node monitoring and auto-cordoning

Talk 2: Realtime Standby Energy Waste Prediction
Speaker: Luka Sturtewagen - Principal DE @ Sensorfact
At Sensorfact, our mission is to minimize industrial waste, particularly in energy consumption. This way we help our customers to raise the bar for their sustainability KPIs. For example, we measure for our customers the energy usage at the individual machine level. Armed with this detailed, but mass data, we provide tailored advice to them on reducing energy waste, including areas such as energy use outside production hours, compressed air leakages, and suboptimal machine usage. We have ML models to detect standby energy waste in batch. Recently we have even transformed our pipeline to be able to predict in real time. This allows us to provide our customers with immediate insights and alerts through our app, ultimately enabling proactive waste reduction strategies.

Events in Amsterdam
AI/ML
Artificial Intelligence
Machine Learning
Data Science
Data Visualization

Members are also interested in