Realtime Advanced Analytics: Spark Streaming+Kafka, MLlib/GraphX, SQL/DataFrames

This is a past event

229 people went

PagerDuty

501 2nd St., Suite 100 · San Francisco, CA

How to find us

Please be sure to enter the PagerDuty office through the side door on 2nd Street. The metal gate will be open. Signs will be posted from there. Bathrooms require a fob, but PagerDuty staff will be standing nearby to let you in.

Location image of event venue

Details

The inaugural session of the Advanced Apache Spark Meetup is starting out with a bang!

We'll present a real-world, open source, advanced analytics and machine learning pipeline using *all 20* Open Source technologies listed below.

This Meetup is based on my recent "Top-5" Hadoop Summit/Data Science talk called "Spark After Dark". Spark After Dark is a mock online dating site that uses Spark, Spark SQL, DataFrames, MLlib, GraphX, Cassandra, and ElasticSearch - among many other technologies listed below - to generate quality, real-time dating recommendations for its users.

Here are the Spark After Dark slides: http://www.slideshare.net/cfregly/spark-after-dark-real-time-advanced-analytics-and-machine-learning-with-spark

All code - and the entire pipeline runtime - will be dockerized and made publicly available on Github and the Docker Hub Registry.

Technologies to be demo'd:
1) Apache Zeppelin (notebook-based development)

2) Apache Spark SQL/DataFrames (Data Analysis and ETL)

3) Apache Spark Streaming + Apache Kafka (Real-time Collection of Live Data from Interactive Demo)

4) Spark Streaming + Real-time Machine Learning (K-Means Clustering, Log/Lin Regression)

5) Apache Spark MLlib + GraphX (Generate personalized and non-personalized recommendations using various algorithms and feature engineering techniques including one hot encoding)

6) MLlib + PMML Integration (Open Standard Markup Language for Predictive Models)

7) Highly-scalable, NetflixOSS-based Machine Learning Prediction Serving Layer including Service Discover (Eureka) and Circuit Breakers (Hystrix) for Fault Tolerance

8) Zeppelin + Python-based scikit-learn Machine Learning

9) Spark + Neo4j = MazeRunner (Real-time Neo4j Graph Updates Beyond GraphX Batch Analytics)

10) Spark R (Distributed R algorithmns)

11) Apache Spark JDBC/ODBC Thrift Server (Beeline and Tableau Analytics Explorer Integration)

12) Tachyon (Off-heap storage)

13) Spark Job Server (REST API for managing Spark jobs)

14) Spark + Cassandra (NoSQL, Lambda Arch Speed Layer)

15) Spark + ElasticSearch (Distributed Search Engine)

16) Spark + Redis (Distributed, Persistent Key-Value Store Similar to Memcached)

17) Logstash (Log Agent + Collection)

18) Kibana (ElasticSearch-based Analytics Explorer UI)

19) HDFS + Parquet (Columnar Storage Format, Tight Compression, Lightning Fast Columnar Aggregations)

20) Advanced visualizations within Zeppelin using python-based matplotlib and ggplot

Reminder that we'll be Docker-izing *everything* for you to reuse.

Keep an eye on the Github and Docker Hub Registry links under project name "fluxcapacitor":

1) https://github.com/fluxcapacitor

2) https://registry.hub.docker.com/repos/fluxcapacitor/

Bonus: Free 30-day Trial @ www.databricks.com

Databricks Cloud Notebook-based Development and Cluster Management.

Thanks, Databricks!

See everyone soon!