Skip to content

Details

Chris Freely, who recently left Databricks (Spark people) to join the IBM Spark Technology Center in San Francisco, will present a real-world, open source, advanced analytics and machine learning pipeline using all 20 Open Source technologies listed below.

This Meetup is based on Chris recent "Top-5" Hadoop Summit/Data Science talk called "Spark After Dark". Spark After Dark is a mock online dating site that uses Spark, Spark SQL, DataFrames, MLlib, GraphX, Cassandra, and ElasticSearch - among many other technologies listed below - to generate quality, real-time dating recommendations for its users.

Here are the Spark After Dark slides: http://www.slideshare.net/cfregly/spark-after-dark-real-time-advanced-analytics-and-machine-learning-with-spark

All code - and the entire pipeline runtime - will be dockerized and made publicly available on Github and the Docker Hub Registry.

Technologies to be demo'd:

  1. Apache Zeppelin (notebook-based development)
  2. Apache Spark SQL/DataFrames (Data Analysis and ETL)
  3. Apache Spark Streaming + Apache Kafka (Real-time Collection of Live Data from Interactive Demo)
  4. Spark Streaming + Real-time Machine Learning (K-Means Clustering, Log/Lin Regression)
  5. Apache Spark MLlib + GraphX (Generate personalized and non-personalized recommendations using various algorithms and feature engineering techniques including one hot encoding)
  6. MLlib + PMML Integration (Open Standard Markup Language for Predictive Models)
  7. Highly-scalable, NetflixOSS-based Machine Learning Prediction Serving Layer including Service Discover (Eureka) and Circuit Breakers (Hystrix) for Fault Tolerance
  8. Zeppelin + Python-based scikit-learn Machine Learning
  9. Spark + Neo4j = MazeRunner (Real-time Neo4j Graph Updates Beyond GraphX Batch Analytics)
  10. Spark R (Distributed R algorithmns)
  11. Apache Spark JDBC/ODBC Thrift Server (Beeline and Tableau Analytics Explorer Integration)
  12. Tachyon (Off-heap storage)
  13. Spark Job Server (REST API for managing Spark jobs)
  14. Spark + Cassandra (NoSQL, Lambda Arch Speed Layer)
  15. Spark + ElasticSearch (Distributed Search Engine)
  16. Spark + Redis (Distributed, Persistent Key-Value Store Similar to Memcached)
  17. Logstash (Log Agent + Collection)
  18. Kibana (ElasticSearch-based Analytics Explorer UI)
  19. HDFS + Parquet (Columnar Storage Format, Tight Compression, Lightning Fast Columnar Aggregations)
  20. Advanced visualizations within Zeppelin using python-based matplotlib and ggplot

Reminder that we'll be Docker-izing everything for you to reuse.

Keep an eye on the Github and Docker Hub Registry links under project name "fluxcapacitor":

  1. https://github.com/fluxcapacitor
  2. https://registry.hub.docker.com/repos/fluxcapacitor/

Members are also interested in