Skip to content

Spark on Mesos - "The Road Less Travelled" & Profiling users using Spark

Photo of Demi Ben-Ari
Hosted By
Demi B. and shlomi h.
Spark on Mesos - "The Road Less Travelled" & Profiling users using Spark

Details

18:00 -18:30 - Mingling

18:30 -19:30 - Spark on Mesos - "The Road Less Travelled" (Eng) - Morri Feldman @ AppsFlyer

As a startup, we've had the luxury of developing our batch processing infrastructure from scratch allowing us to incorporate some unconventional combinations of technologies. I will outline our current infrastructure, Spark running over mesos with data stored exclusively on S3 as a mixture of raw data in Hadoop sequence files and Parquet files, and explain the advantages it offers us over a more typical setup with Spark running on top of YARN backed by HDFS. However, running spark in this way has not been without challenges and a few set backs. I will highlight a few of the larger problems we encountered and what we did to solve them. Despite the challenges, choosing Spark has opened up many possibilities for us. To highlight the performance and flexibility we gained by using Spark, I will dive into one process, Retention, that we originally implemented using Cascalog (Datalog translated into Cascading/Hadoop) and then rewrote as a Spark job.

Morri's bio -
Morri studied epi-genetics as a post-doc at the Weizmann institute and has
PhD in Biophysics from University of California San Francisco. He left the world of academia to crack Big Data problems. That's why he joined the AppsFlyer Dev team.

19:30 - 19:40 - Coffee Break

19:40 -20:40 - Profiling users with Spark and Elasticsearch(Heb) - Itai Yaffe, a Big Data Infrastructure team member @ eXelate

In this session, we'll present how eXelate uses Spark and Elasticsearch to profile users and
answer questions such as: How many internet users are men living in the US and were
interested in traveling this month?
As both these engines are the "hot trend" in the Big Data world, we'll review our way of
combining them, including:

  • Processing the data using Spark
  • Indexing the processed data directly into Elasticsearch using
    elasticsearch-hadoop plugin-in for Spark
  • Managing the flow using some of the services provided by AWS (EMR,
    Data Pipeline, etc.)

We'll provide some tips and discuss some of the pitfalls we encountered while setting-up this process.

Photo of Big Things group
Big Things
See more events
Google Campus TLV (26th floor)
Electra Tower, 98 Yigal Alon · Tel Aviv-Yafo