Big Data Application Meetup 04/27


Details
Shoutout to Ampool (http://www.ampool.io/) for sponsoring this meetup!
AGENDA
6:00 - 6:30 - Socialize over food and beer(s)
6:30 - 7:30 - Talks
TALKS
Talk #1 Introducing Pachyderm, by Joe Doliner from Pachyderm
Talk #2 Leveraging Big Data at TubeMogul to convert Events -> Insights -> Actions, by Murtaza Doctor and John Trenkle from TubeMogul
Please note: Shivram Mani's talk on "Unified access framework for distributed data system on HDFS" has been postponed to the next Big Data App meetup (https://www.meetup.com/BigDataApps/events/229659115/).
ABSTRACTS
Talk #1 Introducing Pachyderm, by Joe Doliner from Pachyderm
Pachyderm is a big data analytics platform deployed with Kubernetes and Docker. Pachyderm is inspired by the Hadoop ecosystem but shares no code with it. Instead, we leverage the container ecosystem to provide the broad functionality of Hadoop with the ease of use of Docker.
In this talk, we’ll show you how you can build streaming data workflows.
There are two bold new ideas in Pachyderm:
• Containers as the core primitive for computation -- which means each stage in your workflow can be written using any languages or libraries you want.
• Version Control for data -- view diffs of your data and incrementally process only the new data as it streams in.
These ideas lead directly to a system that's much more powerful, flexible and easy to use. Pachyderm is open source so check it out on GitHub.
Talk #2 Leveraging Big Data at TubeMogul to convert Events --> Insights --> Actions, by Murtaza Doctor and John Trenkle from TubeMogul
TubeMogul is a leader in digital advertising delivering our client's creative content to desktops, mobile phones , programmatic TV and, ultimately, any device that can show engaging Ads to users. Over the course of 10 years, the scale of data flowing through our RTB (Real-Time Bidding) system has increased exponentially. As this flow has increased, so has our data ecosystem evolved to handle the collection and ETL of this data for the purposes of billing clients, fueling Optimization, Machine Learning, and Analytics. In this talk we'll discuss the path we've followed that has employed Hadoop, Hive, Spark and Presto, as well as Cascading and other variations to fulfill specific functions of our system. We'll talk about specific use cases in our platform and will end with a hint, the directions that this trajectory is taking us.
SPEAKER BIOS
• Joe Doliner is the founder and CEO of Pachyderm and an open source aficionado and has been building and running data infrastructure his entire career. Before Pachyderm, he was the first employee and lead engineer at RethinkDB and also did a stint running the Hadoop cluster at Airbnb. There he gained an appreciation for the vast collaboration and dependency management problems that still plague modern data-driven enterprises. He founded Pachyderm in 2014 to solve these issues.
• Murtaza Doctor is currently the Director of Engineering at TubeMogul, working on the RTB platform. His interests are working on large scale distributing systems in the advertising and personalization domains.
• John Trenkle, as Chief Scientist, is responsible for leading TubeMogul's Machine Learning and Data Science teams. Under John's leadership, these teams streamline the company's big data pipeline by developing and implementing algorithms that power Campaign Optimization, Audience Segmentation, Auction-level Fraud Detection and Mitigation, Inventory Forecasting and Cross-Device Syncing.
ARRIVAL AND PARKING
Cask HQ is a few minutes walk from the California Avenue Caltrain Station.
Also, Cask HQ has its own parking lot, but it will certainly not accommodate all guests. Please use parking lots available nearby:
http://photos2.meetupstatic.com/photos/event/5/b/2/f/600_438983343.jpeg

Big Data Application Meetup 04/27