Yotpo Engineering #3 - Advanced Solutions for Big Data Orchestration


Details
Overview:
We are happy to invite you to our 3rd meetup, which will be hosted by our Data Group. In this meetup, we will focus on technologies for creating and managing a full data orchestration.
We are also happy to host a talk by Yuval Carmel from Singular, who will share with us how Singular adapted and incorporated Celery in their big data operation.
Please note: the lectures will be given in Hebrew.
Agenda:
18:00 - 18:30: Gathering
18:30 - 18:55: Talk #1: Managing Your Big Data Operation Using Airflow
18:55 - 19:20: Talk #2: The Benefits of Running Spark on Nomad
19:20 - 19:35: Break
19:35 - 20:00: Talk #3: Advanced Celery Tricks
20:00+: Mingling
Abstracts:
18:30-18:55 - Managing Your Big Data Operation Using Airflow
Apache Airflow is an open-source platform that allows developers to easily create & manage their data pipelines. Airflow’s infrastructure provides great flexibility in terms of creating these pipelines. However, there are a few fundamental concepts - such as scheduling and scaling - that must be understood in order to realize the full potential of Airflow.
In this talk, we will describe how Airflow works, describe the potential pitfalls when working with Airflow and present our best-practice recommendations for working with Airflow.
Nadav Bar-Uryan,
Data Engineer at Yotpo
18:55-19:20 - The Benefits of Running Spark on Nomad
Nowadays, many of an organization’s main applications rely on Spark pipelines. As these applications become more significant to businesses, so does the need to quickly deploy, test and monitor them.
The standard way of running spark jobs is to deploy it on a dedicated managed cluster. However, this solution is relatively expensive with potentially high setup time. Therefore, we developed a way to run Spark on any container orchestration platform. This allows us to run Spark in a simple, custom and testable way.
In this talk, we will present our open-source dockers for running Spark on Nomad servers. We will cover:
- The issues we had running spark on managed clusters and the solution we developed.
- How to build a spark docker.
- And finally, what you may achieve by using Spark on Nomad.
Shir Bromberg,
Big Data Developer at Yotpo
19:35-20:00 - Advanced Celery Tricks
In Singular, we have a data pipeline which consists of hundreds of thousands of daily tasks, in varying length (from less than a second to hours per task), and with complex dependencies between them. In addition, we integrate with hundreds of third-party providers, which means that tasks are not necessarily reliable / predictable, so we need to be robust to failures and delays and be able to monitor them easily.
We found Celery to be highly suitable to our needs as a task infrastructure, especially due to its distributed nature, its support for various workflows and its modular design. In particular, the fact that it is compatible with multiple technologies for conveying messages ("brokers") and storing results ("backends") greatly appealed to us.
It wasn't an immediate fit however. We needed to extend Celery so it will fit our use cases:
(1) We implemented a custom backend and a custom serialization method.
(2) We tweaked the behavior of Celery's workflows (chains, groups and chords).
(3) We needed to be able to update code easily without restarting workers.
(4) and more..
In this session we will discuss how we adapted Celery to our needs, as well as tools we developed for working with it better, and various advanced tips & tricks.
Yuval Carmel,
R&D Team Leader at Singular

Yotpo Engineering #3 - Advanced Solutions for Big Data Orchestration