http://photos2.meetupstatic.com/photos/event/7/9/d/0/600_438691184.jpeg
Join us at the 11th Apache Flink (https://flink.apache.org/) Meetup, drinks and sandwiches sponsored by data Artisans (http://data-artisans.com/).
Talks
- Flink Community Update
By Robert Metzger
- Tracking the Trackers with Apache Flink
by Sebastian Schelter
This talk will present some work-in-progress on identifying web trackers (e.g. Google Analytics, Facebook-Buttons, etc) in the CommonCrawl 2012 web corpus. We describe how we scanned more than 3.5 billion html pages for online trackers and how we use Flink to analyze the resulting tracking graph. We will present some preliminary results on the distribution of Google Analytics, etc on the web and show how the dominating tracking companies differ per top level domain.
3. Cascading on Apache Flink
By Fabian Hueske
Cascading is a popular framework to develop, maintain, and execute large-scale and robust batch data analysis applications. Originally, Cascading flows have been compiled into Apache Hadoop MapReduce programs. With the recent 3.0 release, Cascading added an extensible rule-based planner and support for Apache Tez as a runtime back-end. Apache Flink’s execution engine features low-latency pipelined and scalable batched data transfers and high-performance, in-memory operators for sorting and joining that gracefully go out-of-core in case of scarce memory resources. With its native support Hadoop YARN, Flink is another attractive runtime back-end for Cascading.
This talk introduces the Cascading Connector for Apache Flink. The connector translates Cascading flows into Apache Flink programs. Cascading flows executed using the Flink connector benefit from Flink’s runtime features such as its pipelined data shuffles and its efficient and robust in-memory operators. The talk describes the integration of Cascading and Flink, highlights its features, and points out its current limitations.