Large scale weak supervision & machine translation (tentative)

DataKRK (formerly Cracow Hadoop User Group)
DataKRK (formerly Cracow Hadoop User Group)
Public group

Community Hub Kraków

Podwale 3 · Kraków

How to find us

Please select "6" on the intercom head to the 3rd floor (there's no elevator in the building)

Location image of event venue



18:00 Networking
18:30 Large Scale Weak Supervision with Snorkel and Apache Beam by Suneel Marthi
19:30 Break
19:45 Machine Translation (tentative)

Note: "Scalable recommendations in a hybrid environment by Mikolaj" needs to by postponed last minute due to the sickness. We will probably have another talk, on machine translation, but this will be confirmed just before the event.

1. Large Scale Weak Supervision with Snorkel and Apache Beam

The advent of Deep Learning models has led to a massive growth of real-world machine learning. The models models rely on massive hand-labeled training datasets which is a bottleneck in developing and modifying machine learning models.

Most large scale Machine Learning systems today like Google’s DryBell use some form of Weak Supervision to construct lower quality, large scale training datasets that can be used to continuously retrain and deploy models in a real-world scenario.

The challenge with continuous retraining is that one needs to maintain prior state (e.g., the learning functions in case of Weak Supervision or a pre-trained model like BERT or Word2Vec for Transfer Learning) that is shared across multiple streams, while continuously updating the model. Apache Beam’s Stateful Stream processing capabilities are a perfect match here including support for scalable Weak Supervision.

The audience would come away with a better understanding of how Weak Supervision with Apache Beam’s stateful stream processing can be used to accelerate the labeling of training data, and real-time training and update of machine learning models.


Suneel is a Member of Apache Software Foundation and is a Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Stream. He presently works as a Principal Technologist – AI/ML at Amazon Web Services. He’s previously presented at Flink Forward, Hadoop Summit Europe, Berlin Buzzwords, Machine Learning Conference and Apache Big Data in the past. He’s based out of Dulles, Virginia in the Washington DC Metro area.


2. Machine translation (tentative)