Realtime Advanced Analytics: Spark Streaming+Kafka, MLlib/GraphX, SQL/DataFrames

This is a past event

306 people went

Location image of event venue

Details

[Replay for the Peninsula/South Bay]

We'll present a real-world, open source, advanced analytics and machine learning pipeline using all "15" Open Source technologies listed below.

This Meetup is based on my recent "Top-5" Hadoop Summit/Data Science talk called "Spark After Dark". Spark After Dark is a mock online dating site that uses Spark, Spark SQL, DataFrames, MLlib, GraphX, Cassandra, and ElasticSearch - among many other technologies listed below - to generate quality, real-time dating recommendations for its users.

Here are the Spark After Dark slides: http://www.slideshare.net/cfregly/spark-after-dark-real-time-advanced-analytics-and-machine-learning-with-spark

All code - and the entire pipeline runtime - will be dockerized and made publicly available on Github and the Docker Hub Registry.

Technologies to be demo'd:
1) Apache Zeppelin and Spark-Notebook (notebook-based development)

2) Apache Spark SQL/DataFrames (Ad hoc Data Analysis and ETL)

3) Apache Spark Streaming + Apache Kafka (Real-time Collection of Live Data from Interactive Demo)

4) Spark Streaming + Real-time Machine Learning (K-Means Clustering, Log/Lin Regression)

5) Apache Spark MLlib + GraphX (Generate personalized and non-personalized recommendations using various algorithms and feature engineering techniques including one hot encoding)

6) MLlib + PMML Integration (Open Standard Markup Language for Predictive Models)

7) Zeppelin + Python-based scikit-learn Machine Learning + Advanced Visualizations with matplotlib and ggplot

8) Spark R (Distributed R algorithmns)

9) Apache Spark JDBC/ODBC Thrift Server (Beeline and Tableau Analytics Explorer Integration)

10) Tachyon (Off-heap storage)

11) Spark + Cassandra

12) Spark + ElasticSearch (Distributed Search Engine)

13) Logstash (Log Agent + Collection)

14) Kibana (ElasticSearch-based Analytics Explorer UI)

15) HDFS + Parquet (Columnar Storage Format, Tight Compression, Lightning Fast Columnar Aggregations)

Reminder that we've Docker-ized *everything* for you to take home.

Here's the Github Repo and Docker Hub Registry links:

1) https://github.com/fluxcapacitor

2) https://registry.hub.docker.com/repos/fluxcapacitor/

Feel free to clone and contribute! Every contributor will be made a committer.

Bonus: Free 30-day Trial @ www.databricks.com

Databricks Cloud Notebook-based Development and Cluster Management.

Thanks, Databricks!

See everyone soon!