Introduction to Spark

This is a past event

255 people went

Details

Overview

We are pleased to have Gwen Shapira from Cloudera (http://www.cloudera.com/) speak about Apache Spark (http://spark.apache.org/), and Jared Poelman and Mark Nelson speak about the practical implementation of Spark at Trueffect (http://www.trueffect.com/). We are expecting this to be an excellent session and look forward to see you on April 23rd! Also, many thanks go out to Oracle (http://www.oracle.com/) for providing the facilities for this event. They are currently hiring and if interested, please find Kevin Markey at the meetup, and or he can be reached at [masked] .

Technical Presentation: Introduction to Spark

It is our great pleasure to have Cloudera and Gwen Shapira present on Spark. Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop. To make programming faster, Spark provides clean, concise APIs in Python, Scala and Java. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets.

The presentation will explain the Spark programming model and reasons to use Spark. We will dive into specific use-cases and code examples, and finish by discussing one of the most exciting projects using Spark - SparkStreaming.

Use Case Presentation: Spark at Trueffect

Trueffect uses Spark (along with Hadoop) for analysis of our of our ad server logs. We take advantage of Spark’s ability to cache data in memory to run iterative algorithms in hours that would take days to run on Hadoop. We take advantage of Spark’s improved performance to run empirical performance analyses that would be prohibitively slow and expensive in Hadoop. We are also exploring the use of Spark Streaming to reduce processing latency.

Agenda:

6:00 – 6:30 – Socialize over food and drink

6:30 – 6:45 – Announcements, Upcoming Events

6:45 – 7:45 – Spark – Gwen Shapira

7:45 – 8:15 – Use Case: Spark – Jared Poelman and Mark Nelson

8:15 – 8:30 – ??? - Continued socializing

About the Presenter’s:

Gwen Shapira, Cloudera

Gwen is a Solutions Architect at Cloudera. She has 15 years of experience working with customers to design scalable data architectures. Working as a data warehouse DBA, ETL developer and a senior consultant, she specializes in migrating data warehouses to Hadoop, integrating Hadoop with relational databases, building scalable data processing pipelines, and scaling complex data analysis algorithms. In addition, Gwen is a frequent speaker at industry conferences and maintains a popular blog.

Jared Poelman, Trueffect

Jared is a software engineer at Trueffect, specializing in data processing applications within the Spark and Hadoop ecosystems. Prior to Trueffect, he built software for Volkswagen Group’s experimental research vehicles in Germany and California. He is a native of San Jose, California.

Mark Nelson, Trueffect

Mark is a senior software engineer at Trueffect, working primarily with Spark and Hadoop. He has been working with parallel systems starting with his first real job at Cray Research in 1987. He has moved too many times to be a native of anywhere, but he currently lives in Fort Collins.

See you then!

Brett and Andy