Here's an agenda that should get you warmed up for Hadoop Summit! Note that all NL-HUG members receive a 10% discount on Hadoop Summit registration using our promo code.
The location is meeting room St. John in Hotel Krasnapolsky.
Afternoon Apache Drill training (separate registration)
17.30 - 18.20 Drinks, food, socialising
18.20 - 18.30 Welcome - Evert Lammerts
18.30 - 19.00 A Hadoop ecosystem outlook - Arun Murthy
19.10 - 19.40 Graph Processing at Twitter - Jimmy Lin
19.50 - 20.20 Large-scale Forensic Science - Erwin van Eijk
20.30 - 21.00 Practical Real-time Learning - Ted Dunning
About the talks & the speakers
Arun Murthy with a Hadoop ecosystem outlook
This talk will kick-off the evening with an overview of the developments around Apache Hadoop. Arun will discuss Tez (faster MapReduce), Stinger (100x faster Hive), and Knox (the Hadoop security gateway).
Arun is founder of Hortonworks
Jimmy Lin on Graph Processing at Twitter
A Twitter user's local neighborhood in the interest graph provides a rich source of information for applications such as link prediction, interest modeling, and personalization. As with other social networks, the Twitter graph is simultaneously sparse and dominated by short paths between arbitrary vertices. In this talk, I'll discuss some of the systems we use for working with the Twitter graph: MySQL for real-time manipulation on the front end and an open source, in-memory compact graph storage and analysis engine we built called Cassovary. At Twitter, Cassovary forms the bottom layer of a stack that we use to power many of our graph-based features, including "Who to Follow" and "Similar to", and also personalization for relevance ranking in Twitter search.
Jimmy is an associate professor in the iSchool at the University of Maryland
Erwin van Eijk on Large-scale forensic science
The Netherlands Forensic Institute deals with data extracted from devices that have been seized during a criminal procedure. The number of devices and the amount of data they hold can be quite high - on average some 4TB, with some being as high as 140TB. Still the investigators need to be able to access the data as soon as possible, which means a delivery time measured in minutes, not hours.
By using an amalgam of distributed systems the goal is to be able to analyse more than 1TB/hour. We will dive into the some of the data particularities, as well as the impact those issues have on performance and/or suitability to the general Hadoop way of things.
In summary, this talk will present some of the challenges encountered when dealing with a heterogeneous dataset, and will also present some measurements comparing two different cluster setups to show the pros and cons of each cluster setup.
Erwin is Forensic Scientist at the Netherlands Forensic Institute
Ted Dunning on Practical Real-time Learning
This talk will describe how real-time learning can be used for advanced A/B testing as well as a variety of advertising and document targeting problems. The crux of these applications is the Bayesian Bandit algorithm. This algorithm is simple but provides state-of-the-art performance. This talk will be intuitive and practical, but not simple-minded. It will include interactive demonstrations of state-of-the-art algorithms for real-time learning and will describe the architecture required to implement these algorithms.
Ted is Chief Application Architect at MapR