Skip to content

Realtime Advanced Analytics: Spark Streaming+Kafka, MLlib/GraphX, SQL/DataFrames

Photo of Chris Fregly
Hosted By
Chris F.
Realtime Advanced Analytics: Spark Streaming+Kafka, MLlib/GraphX, SQL/DataFrames

Details

[Replay for the Peninsula/South Bay]

We'll present a real-world, open source, advanced analytics and machine learning pipeline using all "15" Open Source technologies listed below.

This Meetup is based on my recent "Top-5" Hadoop Summit/Data Science talk called "Spark After Dark". Spark After Dark is a mock online dating site that uses Spark, Spark SQL, DataFrames, MLlib, GraphX, Cassandra, and ElasticSearch - among many other technologies listed below - to generate quality, real-time dating recommendations for its users.

Here are the Spark After Dark slides: http://www.slideshare.net/cfregly/spark-after-dark-real-time-advanced-analytics-and-machine-learning-with-spark

All code - and the entire pipeline runtime - will be dockerized and made publicly available on Github and the Docker Hub Registry.

Technologies to be demo'd:

  1. Apache Zeppelin and Spark-Notebook (notebook-based development)

  2. Apache Spark SQL/DataFrames (Ad hoc Data Analysis and ETL)

  3. Apache Spark Streaming + Apache Kafka (Real-time Collection of Live Data from Interactive Demo)

  4. Spark Streaming + Real-time Machine Learning (K-Means Clustering, Log/Lin Regression)

  5. Apache Spark MLlib + GraphX (Generate personalized and non-personalized recommendations using various algorithms and feature engineering techniques including one hot encoding)

  6. MLlib + PMML Integration (Open Standard Markup Language for Predictive Models)

  7. Zeppelin + Python-based scikit-learn Machine Learning + Advanced Visualizations with matplotlib and ggplot

  8. Spark R (Distributed R algorithmns)

  9. Apache Spark JDBC/ODBC Thrift Server (Beeline and Tableau Analytics Explorer Integration)

  10. Tachyon (Off-heap storage)

  11. Spark + Cassandra

  12. Spark + ElasticSearch (Distributed Search Engine)

  13. Logstash (Log Agent + Collection)

  14. Kibana (ElasticSearch-based Analytics Explorer UI)

  15. HDFS + Parquet (Columnar Storage Format, Tight Compression, Lightning Fast Columnar Aggregations)

Reminder that we've Docker-ized everything for you to take home.

Here's the Github Repo and Docker Hub Registry links:

  1. https://github.com/fluxcapacitor

  2. https://registry.hub.docker.com/repos/fluxcapacitor/

Feel free to clone and contribute! Every contributor will be made a committer.

Bonus: Free 30-day Trial @ www.databricks.com

Databricks Cloud Notebook-based Development and Cluster Management.

Thanks, Databricks!

See everyone soon!

Photo of AI Performance Engineering Meetup (San Francisco, Global) group
AI Performance Engineering Meetup (San Francisco, Global)
See more events
Base CRM
850 N. Shoreline · Mountain View, CA