Our May meetup is kindly hosted by MindCandy.com in their London office and sponsored by Hortonworks. We'll have refreshments and food provided as usual and will be recording the talks to put online after.
We've got three great speakers for the evening:
Fast > Perfect: Practical real-time approximations using Spark Streaming
By Kevin Schmidt (Head of Data Science at Mind Candy)
Luis Vicente (Senior Data Engineer at Mind Candy)
For mobile games, constant tweaks are the difference between success and failure. Data and analytics have to be available in real-time, but calculating, for example, uniqueness or newness of a data point requires a list of seen data points - both memory intensive and tricky when using real-time stream processing like Spark Streaming. Probabilistic data structures allow approximation of these properties with a fixed memory representation, and are very well suited for this kind of stream processing. Getting from the theory of approximation to a useful metric at a low error rate even for many millions of users is another story. In our talk we will look at practical ways of achieving this: which approximation we used for selection of useful metrics, why we picked a specific probabilistic data structure, how we stored it in Cassandra as a time series and how we implemented it in Spark Streaming.
Hadoop - Looking to the Future
By Arun Murthy (Founder of Hortonworks, Creator of YARN)
The Apache Hadoop ecosystem began as just HDFS & MapReduce nearly 10 years ago in 2006.
Very much like the Ship of Theseus (http://en.wikipedia.org/wiki/Ship_of_Theseus), Hadoop has undergone incredible amount of transformation from multi-purpose YARN to interactive SQL with Hive/Tez to machine learning with Spark.
Much more lies ahead: whether you want sub-second SQL with Hive or use SSDs/Memory effectively in HDFS or manage Metadata-driven security policies in Ranger, the Hadoop ecosystem in the Apache Software Foundation continues to evolve to meet new challenges and use-cases.
Arun C Murthy has been involved with Apache Hadoop since the beginning of the project - nearly 10 years now. In the beginning he led MapReduce, went on to create YARN and then drove Tez & the Stinger effort to get to interactive & sub-second Hive. Recently he has been very involved in the Metadata and Governance efforts. In between he founded Hortonworks, the first public Hadoop distribution company.
Rolling updates with Hadoop
By Sanjay Radia (Founder of Hortonworks)
Customers use Hadoop for critical enterprise applications where Hadoop services and the cluster must continue to function while the cluster software is upgraded in a rolling fashion. This talk describes our work towards an enterprise-quality rolling upgrade solution where the cluster software is upgraded with actively running applications and services. Hadoop is a complex distributed system with interdependencies and we had to enhance several components.