Past Meetup

Databricks comes to Barcelona

This Meetup is past

61 people went

Price: €2.00 /per person
Location image of event venue


Este cuarto meeting contará con Aaron Davidson (Apache Spark committer e Ingeniero de Software en Databricks) y Paco Nathan (Community Evangelism Director at Databricks) ,

que nos hablarán acerca de 'Building a Unified Data Pipeline in Spark' (conferencia en Inglés). La charla se realizará el próximo jueves 20/Noviembre a las 18.30, en la sala de actos de la FIB, en el campus Nord de la UPC. Os esperamos a todos. No falteis!

This fourth meeting will feature Aaron Davidson (Apache Spark committer and Software Engineer at Databricks) and Paco Nathan (Community Evangelism Director at Databricks) ,

speaking about 'Building a Unified Data Pipeline in Spark' (talk in English). The talk will start next Thursday 20th November, 18:30 at sala de actos de la FIB (campus Nord - UPC). We will wait for all you!

[ abstract ]

One of the promises of Apache Spark is to let users build unified data analytic pipelines that combine diverse processing types. In this talk, we’ll demo this live by building a machine learning pipeline with 3 stages: ingesting JSON data from Hive; training a k-means clustering model; and applying the model to a live stream of tweets. Typically this pipeline might require a separate processing framework for each stage, but we can leverage the versatility of the Spark runtime to combine Shark, MLlib, and Spark Streaming and do all of the data processing in a single, short program. This allows us to reuse code and memory between the components, improving both development time and runtime efficiency. Spark as a platform integrates seamlessly with Hadoop components, running natively in YARN and supporting arbitrary Hadoop InputFormats, so it brings the power to build these types of unified pipelines to any existing Hadoop user.

This talk will be a fully live demo and code walkthrough where we’ll build up the application throughout the session, explain the libraries used at each step, and finally classify raw tweets in real-time.