Skip to content

Real-time Aggregations, Approximations, Similarities, and Recommendations

Photo of Donna Fernandez
Hosted By
Donna F.
Real-time Aggregations, Approximations, Similarities, and Recommendations

Details

Hi. This event follows the IBM Spark PoT on the same day. You must RSVP here for the evening Meetup separately but no need to also be registered for the PoT if you are ONLY going to the Meetup. See you there! -d

Title*

Real-time Aggregations, Approximations, Similarities, and Recommendations at Scale using Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird

Agenda

Intro

Live, Interactive Recommendations Demo

Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird

(advancedspark.com (http://advancedspark.com/))

Types of Similarity

Euclidean vs. Non-Euclidean Similarity

Jaccard Similarity

Cosine Similarity

LogLikelihood Similarity

Edit Distance

Text-based Similarities and Analytics

Word2Vec

LDA Topic Extraction

TextRank

Similarity-based Recommendations

User-to-User

Content-based, Item-to-Item (Amazon)

Collaborative-based, User-to-Item (Netflix)

Graph-based, Item-to-Item "Pathways" (Spotify)

Aggregations, Approximations, and Similarities at Scale

Twitter Algebird

MinHash and Bucketing

Locality Sensitive Hashing (LSH)

BloomFilters

CountMin Sketch

HyperLogLog

Q & A

Bio

Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer.

Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com (http://advancedspark.com/).

Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.

When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.

Related Links

https://github.com/fluxcapacitor/pipeline/wiki

http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf

http://static.echonest.com/BoilTheFrog/

http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf

http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/

http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf

Photo of Washington DC Area Apache Spark Interactive group
Washington DC Area Apache Spark Interactive
See more events
IBM-TEC
8401 Greensboro Drive, 1st Floor · McLean, VA