Past Meetup

Productionizing Spark Streaming, Tableau Spatial Queries, Spark Search Indexing

This Meetup is past

156 people went

Location visible to members


Talks start at 7:10pm

Productionizing a 24/7 Spark Streaming service on YARNIssac Buenrostro, Arup Malakar (Ooyala)

At Ooyala we must process over two billion video events a day and provide rich, near real-time, and always-available analytics to thousands of customers. Spark Streaming is core to our state of the art ingestion pipeline. In developing this system we have encountered and resolved a large number of undocumented challenges which we would like to share: What are some of the challenges and lessons from productionizing a Spark Streaming pipeline over YARN? How do you ensure 24/7 availability and fault tolerance? What are the best practices for Spark Streaming and its integration with Kafka and YARN? How do you monitor and instrument the various stages of the pipeline? We will dive into all these topics and more.

Issac Buenrostro is a software engineer at Ooyala creating a new ingestion system for video analytics events using Spark, YARN, Thrift, and Parquet. Before Ooyala he obtained a Bachelors degree from MIT and a Masters from Stanford in applied mathematics working on high performance scientific computing.

Arup Malakar works on the next gen ETL pipeline of analytics at Ooyala and uses Spark Streaming, YARN and Kafka for it. Before Ooyala he contributed to apache Hive, HCatalog and helped built the hosted platform for processing feeds at Yahoo! Arup holds a Bachelor in Computer Science from IIT, Guwahati.

Speaker #2: Spatial Analytics Demo

Spatial Analytics demo using Tableau as a follow on to the Hive Spatial Queries posted here:

Goal is to get people started on Tableau visualizations for individual projects.

Demo and short tutorial on how to clean and format data and produce visualizations presented by Elaine Chen, PM for the Tableau online product.

Speaker #3:Streamlining Search Indexing using Elastic Search and Spark

Holden Karau (Databricks)

Everyone who has maintained a search cluster knows the pain of keeping our on-line update code and offline reindexing pipelines in sync. Subtle bugs can pop up when our data is indexed differently depending on the context. By using Spark & Spark Streaming we can reuse the same indexing code between contexts and even take advantage reduce overhead by talking directly to the correct indexing node.

Sometimes we need to use search data as part of our distributed map reduce jobs. We will illustrate how to use Elastic Search as side data source with Spark.

We will also illustrate both of these tasks in two real examples using the Twitter firehose. In the first we will index tweets in a geospatial context and in the second we will use the same index to determine the top hashtags per region.

Holden Karau is a software development engineer at Databricks and is active in open source. She is the author of Fast Data Processing With Spark. Prior to Databricks she worked on a variety of search and classification problems at Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelors of Mathematics in Computer Science. Outside of software she enjoys playing with fire, welding, and hula hooping.