Pre Hadoop Summit meetup Brussels

This is a past event

58 people went

Details

NOTE: This meetup is in Brussels!

For the occasion of the Hadoop Summit in Brussels, the NL-HUG will do a one-time show in Belgium. Since this is in a different setting from the usual Amsterdam meetups, talks might overlap a bit with previous events.

Agenda:

05.30 - 06.15: Open reception at Panoramic Hall, Level 5 at The Square Centre

06.30 - 09.00: Talks + Q&A at Studio 213, Level 2 at The Square Centre

We will have three talks:

• Hadoop – Successes, Mistakes, and Vision, by Owen O’Malley, Co-founder of Hortonworks and senior architect

I’ve been working on Hadoop since January 2006, before it was called Hadoop. Along the way, we’ve made a lot of good decisions, but we’ve also made a lot of mistakes. My talk will cover some of each and the lessons that we’ve learned as a result. Finally, I’ll share my vision of where things are going in the Hadoop ecosystem..

• Rapid Prototyping of Online Machine Learning with Divolte Collector, by Friso van Vollenhoven, CTO at GoDataDriven

Divolte Collector ( http://divolte.io ) is scalable open source clickstream collection for Hadoop and Kafka. Traditionally organisations rely on either SaaS providers for clickstream tracking or parsing server log files. Divolte Collector aims to be a better solution providing Google Analytics-style deployment on the front using a single line of JavaScript and providing structured event data in Apache Avro records on the back-end appended to HDFS files and pushed onto Apache Kafka topics. In this talk we have a look at creating an online machine learning application for web optimisation on top of the Divolte Collector stack. We use the popular Bayesian bandit approach to multi-armed bandit optimisation combined with an evolutionary method for inventory selection, implemented as a scalable solution, yet using minimal code.

• Apache Flink, by Stephan Ewen, committer on Apache Flink, co-founder @ Data Artisans

Apache Flink is a distributed data processing system that aims to unify batch and streaming data analysis at the engine level.

Flink offers powerful programming APIs In Java and Scala, based on distributed data sets and data streams, with a wide set of transformations and flexible window definitions, as well as higher-level APIs for specific use cases.
Flink backs these APIs with a robust and unique execution backend. Both batch and streaming APIs are backed by the same execution engine that has true streaming capabilities, resulting in true real-time stream processing and latency reduction in many batch programs. Flink implements its own memory manager and custom data processing algorithms inside the JVM, which makes the system behave very robustly both in-memory and under memory pressure. Flink has iterative processing built-in, implementing native iteration operators that create dataflows with feedback. Finally, Flink contains its own cost-based optimizer, type extraction, and data serialization stack.

The end result is a platform that is fast, easy to program against, unifies batch and stream processing without compromising on latency or throughput, requires very little tuning to sustain data-intensive workloads, and solves many of the problems of heavy data processing inside the JVM. The Flink project has been recently expanding to include more higher-level modules that build on top of the engine, such as a new graph processing library, a Machine Learning library, and a SQL-like interface for programming on tables rather than typed collections. Flink is integrated with the open source ecosystem, including Apache Hadoop (input/output formats, MapReduce API compatibility, and YARN integration), Apache Kafka, Apache Tez, Apache SAMOA and more. Flink is also integrated in the Google Cloud Platform.Flink is a top-level Apache project with more than 85 contributors from industry and academia.

This talk gives an overview of Flink from a user perspective both for batch and stream processing, the most important features of Flink’s runtime and their operational benefits, as well as a roadmap of the project.