• 17.45: eat, drink, socialize
• 18.15: First talk by: Friso van Vollenhoven, CTO at GoDataDriven and Andrew Snare, Big Data hacker at GoDataDriven
Announcing Divolte Collector: scalable click stream data collection for Hadoop and real-time processing
Divolte Collector is a solution for collecting high volume click event data and storing it directly on HDFS in Avro files, compatible with tools like Hive, Impala, MapReduce and Spark. During collection, we parse out domain specific identifiers from the URL structure (e.g. product IDs, page types, etc.) and add rich user agent and IP2geo information on the fly and perform stateless sessionizing based on client side cookies. The resulting Avro files are directly usable for analytics and machine learning tasks; no ETL required, no log file parsing required.
In addition, we push out click events onto Kafka queues as they happen, making the event stream available to near real-time processing using frameworks like Spark Streaming or Storm.
At GoDataDriven, we work a lot on click event data from our customers' websites. Usually, this data is collected by downloading datasets from services like Google Analytics or Omniture tracking, combined with web server log files and custom events from AJAX calls and whatnot. Before data is in a usable shape, we'd typically need to parse several sources of log files with different format and caveats. We developed Divolte Collector as an answer to this problem.
Now, we are releasing Divolte Collector as open source software (Apache 2.0 license), because we decided life's too short for log file parsing.
• 19.00: short break (moar pizza/beer)
• 19.15: Lightning session
If you have idea about interesting hadoop/data related lightning talk - there will be a board to sign up for a ~5 minutes presentation to share your successes/failures/knowledge - please join us on the stage!
• 20:00: Moar beer.