Real-time Aggregation, Approximation, Similarities, and Recommendations at Scale


Details
Agenda
Live, Interactive Recommendations Demo - NiFi, Kafka, Stanford CoreNLP, Docker, Word2Vec, LDA, Twitter Algebird, Spark Streaming, SQL, ML, GraphX.
Deep Dive (advancedspark.com)
Types of Similarity - Euclidean vs. Non-Euclidean Similarity, Jaccard Similarity, Cosine Similarity, LogLikelihood Similarity, Edit Distance
Text-based Similarities and Analytics - Word2Vec, LDA Topic Extraction, TextRank
Similarity-based Recommendations - User-to-User, Content-based, Item-to-Item (Amazon), Collaborative-based, User-to-Item (Netflix), Graph-based, Item-to-Item "Pathways" (Spotify)
Aggregations, Approximations, and Similarities at Scale - Twitter Algebird, MinHash and Bucketing, Locality Sensitive Hashing (LSH), BloomFilters, CountMin Sketch, HyperLogLog
Q & A
Bio
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com (http://advancedspark.com/).
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world
Related Links
https://github.com/fluxcapacitor/pipeline/wiki
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf

Real-time Aggregation, Approximation, Similarities, and Recommendations at Scale