Skip to content

Details

You voted, we listened! Announcing our 1st community lightning talks session! Each speaker will give a 13 min talk +2 min Q&A from the audience. Join in, ask questions!

Talk 1: Simplify Data Conversion from Spark to Deep Learning by Liang Zhang (https://www.linkedin.com/in/liangz1/)
In this talk, I will introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to TensorFlow & PyTorch.

Imagine you have a large dataset, say 20 GBs, & you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean & preprocess your data using Spark. But you may have a problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious & engineering frictions greatly reduced the data scientists’ productivity.

The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model & how simple it is to go from single-node training to distributed training on Databricks.

Talk 2: The critical missing component in the Production ML stack by Alessya Visnjic (https://www.linkedin.com/in/alessya/)
Abstract: The day the ML application is deployed to production & begins facing the real world is the best & the worst day in the life of the model builder. The joy of seeing accurate predictions is quickly overshadowed by a myriad of operational challenges. Debugging, troubleshooting & monitoring takes over the majority of their day, leaving little time for model building. In DevOps, software operations are taken to a level of an art. Sophisticated tools enable engineers to quickly identify & resolve issues, continuously improving software stability & robustness. In the ML world, operations are still largely a manual process that involves Jupyter notebooks & shell scripts. One of the cornerstones of the DevOps toolchain is logging. Traces & metrics are built on top of logs enabling monitoring & feedback loops. What does logging look like in an ML system?

In this talk I will show you how to enable data logging for an AI application using PyTorch & MLflow in a matter of minutes.Attendees will leave the talk equipped with tools & best practices to supercharge MLOps in their team.

Talk 3: Pipeline branching optimization in Apache Spark by Shivangi Srivastana (https://www.linkedin.com/in/shivangi-srivastava1/)
Abstract: Pipeline branching is a common data engineering use case. Data is sent to multiple downstream pipelines to apply different transformations or to load to different targets. A construct like router is often used to control which downstream pipelines a row should be sent to. A router would direct data to different pipelines based on an expression or filter condition. Spark is not designed to handle such use cases very efficiently due to its action-driven & lazy evaluation techniques. For ex. out of the box, the way spark executes a job that has multiple branches is to generate multiple spark jobs (one for each loading target), with each job repeating the logic of the shared upstream pipeline; & spark will run these jobs in sequential order. This results in bad performance, when computation cost of the common pipeline is high. Data consistency can also be a problem if some transformation in the common pipeline doesn’t produce deterministic output, such as random data or UUID generation per row. I'll show how we solved both the performance & functional issues at Informatica by parallelizing the pipelines & adding advanced filtered persistence.

Talk 4: Excel on Endpoints by Franco Patano (https://www.linkedin.com/in/francopatano/)
Abstract: more details coming soon!

Members are also interested in