Spark on Mesos - "The Road Less Travelled" & Profiling users using Spark
Details
18:00 -18:30 - Mingling
18:30 -19:30 - Spark on Mesos - "The Road Less Travelled" (Eng) - Morri Feldman @ AppsFlyer
As a startup, we've had the luxury of developing our batch processing infrastructure from scratch allowing us to incorporate some unconventional combinations of technologies. I will outline our current infrastructure, Spark running over mesos with data stored exclusively on S3 as a mixture of raw data in Hadoop sequence files and Parquet files, and explain the advantages it offers us over a more typical setup with Spark running on top of YARN backed by HDFS. However, running spark in this way has not been without challenges and a few set backs. I will highlight a few of the larger problems we encountered and what we did to solve them. Despite the challenges, choosing Spark has opened up many possibilities for us. To highlight the performance and flexibility we gained by using Spark, I will dive into one process, Retention, that we originally implemented using Cascalog (Datalog translated into Cascading/Hadoop) and then rewrote as a Spark job.
Morri's bio -
Morri studied epi-genetics as a post-doc at the Weizmann institute and has
PhD in Biophysics from University of California San Francisco. He left the world of academia to crack Big Data problems. That's why he joined the AppsFlyer Dev team.
19:30 - 19:40 - Coffee Break
19:40 -20:40 - Profiling users with Spark and Elasticsearch(Heb) - Itai Yaffe, a Big Data Infrastructure team member @ eXelate
In this session, we'll present how eXelate uses Spark and Elasticsearch to profile users and
answer questions such as: How many internet users are men living in the US and were
interested in traveling this month?
As both these engines are the "hot trend" in the Big Data world, we'll review our way of
combining them, including:
- Processing the data using Spark
- Indexing the processed data directly into Elasticsearch using
 elasticsearch-hadoop plugin-in for Spark
- Managing the flow using some of the services provided by AWS (EMR,
 Data Pipeline, etc.)
We'll provide some tips and discuss some of the pitfalls we encountered while setting-up this process.

