Past Meetup

Off line and real-time click stream processing + lightning talks

This Meetup is past

89 people went

Location image of event venue

Details

Time to meetup and talk Hadoop related topics again! As usual food and drinks will be provided. Cupenya (http://cupenya.com/) has kindly offered to host us at Rockstart Spaces in Amsterdam.

Agenda:

• 18.00: eat, drink, socialize

• 19.00: First talk by: Friso van Vollenhoven, CTO at GoDataDriven and Andrew Snare, Big Data hacker at GoDataDriven

Announcing Divolte Collector: scalable click stream data collection for Hadoop and real-time processing

Divolte Collector is a solution for collecting high volume click event data and storing it directly on HDFS in Avro files, compatible with tools like Hive, Impala, MapReduce and Spark. During collection, we parse out domain specific identifiers from the URL structure (e.g. product IDs, page types, etc.) and add rich user agent and IP2geo information on the fly and perform stateless sessionizing based on client side cookies. The resulting Avro files are directly usable for analytics and machine learning tasks; no ETL required, no log file parsing required.

In addition, we push out click events onto Kafka queues as they happen, making the event stream available to near real-time processing using frameworks like Spark Streaming or Storm.

At GoDataDriven, we work a lot on click event data from our customers' websites. Usually, this data is collected by downloading datasets from services like Google Analytics or Omniture tracking, combined with web server log files and custom events from AJAX calls and whatnot. Before data is in a usable shape, we'd typically need to parse several sources of log files with different format and caveats. We developed Divolte Collector as an answer to this problem.

Now, we are releasing Divolte Collector as open source software (Apache 2.0 license), because we decided life's too short for log file parsing.

• 19.45: short break

• 20.00: Second talk, by YOU!

Lightning talks and discussion: Hadoop experiences form the field

Because the second speaker we had invited for the meetup could not make it on this date, we will open up the second talk slot for lightning talks, or short discussions based on questions from all who join us for the meetup. We will prepare a flip-over to draw up a last minute schedule of talk and discussion idea from the audience. Please have your input ready! All ideas are welcome!

• 20.45: socialize and drink more

• ??.??: doors close