Druid NYC Meetup @ Tumblr (w/ Datacouncil.ai & AI / ML Streaming Data Platforms)
This is a collaboration event with:
datacouncil.ai NYC Meetup Group &
AI / ML Streaming Data Platforms NYC
*** Presentations ***
Talk 1: Personalized Related Blog Recommendation with User Feedback
Speaker/Bio: Zhisheng Li, Principal Tech Lead. Zhisheng Li is currently the leading the "Recommendations team" which is responsible for near real-time recommendations, offline Recommendation systems of Tumblr.
Tumblr, as a popular microblogging service platform, has hosted hundreds of millions of blogs nowadays. Similar to other social networking sites, one major behavior for Tumblr users is to follow interesting blogs, so they could directly browse the content from these followed blogs on their dashboards. Blog recommendation has been proven to be an effective way to help users follow blogs, which has contributed to 50% of our daily blog follows. Specifically, related blog recommendation is the best performer among all Tumblr recommendation techniques, which is to recommend the most relevant blogs immediately when a user follows a specific blog. However, the old related blog recommendation system was not personalized and couldn't provide the best user experience. In this work, we propose innovative approaches to leverage user feedback to adjust the rankings of related blogs, so as to improve the relevancy and freshness of related blogs for each particular user. The A/B test results show that such personalized related blog recommendation approach increased the related blog daily follow count and follow rate significantly. We launched the personalized related blog recommendation system into production. Up to date, it brings 2 million daily blog follows at a 17% blog follow rate.
Talk 2: Schema Management and Real-time Enrichment with Kafka
Speaker/Bio: Max McKittrick is a data engineer at Capital One, where he works on the company's enterprise clickstream application, applying DevOps best practices to real-time stream processing. Prior to joining Capital One, he completed his MS in information science at the University of Illinois, where he worked as an NLP researcher and consultant and was later selected as an Insight Data Engineering fellow in summer 2017. In his spare time, he enjoys analyzing data in R and playing modular synthesizers.
At Capital One, the Enterprise Customer Intelligence team engineers maintain a clickstream application that serves the entire company. Kafka is an important part of this application, and messages must be enriched prior to being consumed by other internal teams. In this talk, I will discuss the challenges and lessons learned in developing real-time enrichments.
Talk 3: Inside Apache Druid: Designed for Performance
Speaker/Bio: Gian Merlino, co-founder of Imply, a San Francisco based technology company, and a committer on Apache Druid. Previously, Gian led the data ingestion team at Metamarkets (now a part of Snapchat) and held senior engineering positions at Yahoo. He holds a BS in Computer Science from Caltech.
A technical talk - Apache Druid is a modern analytical database that implements a memory-mappable storage format, indexes, compression, late tuple materialization, and a query engine that can operate directly on compressed data. There is a patch out to add vectorized processing as well, which we can expect to see show up in a future release. This talk goes into detail on how Druid's query processing layer works and how each component contributes to achieving top performance for analytical queries.