Skip to content

Flink Meetup #11 Cascading on Flink & Tracking the Trackers with Flink

Photo of Robert Metzger
Hosted By
Robert M.
Flink Meetup #11 Cascading on Flink & Tracking the Trackers with Flink

Details

http://photos2.meetupstatic.com/photos/event/7/9/d/0/600_438691184.jpeg

Join us at the 11th Apache Flink (https://flink.apache.org/) Meetup, drinks and sandwiches sponsored by data Artisans (http://data-artisans.com/).

Talks



  1. Flink Community Update
    By Robert Metzger

  2. Tracking the Trackers with Apache Flink
    by Sebastian Schelter

This talk will present some work-in-progress on identifying web trackers (e.g. Google Analytics, Facebook-Buttons, etc) in the CommonCrawl 2012 web corpus. We describe how we scanned more than 3.5 billion html pages for online trackers and how we use Flink to analyze the resulting tracking graph. We will present some preliminary results on the distribution of Google Analytics, etc on the web and show how the dominating tracking companies differ per top level domain.

  1. Cascading on Apache Flink
    By Fabian Hueske

Cascading is a popular framework to develop, maintain, and execute large-scale and robust batch data analysis applications. Originally, Cascading flows have been compiled into Apache Hadoop MapReduce programs. With the recent 3.0 release, Cascading added an extensible rule-based planner and support for Apache Tez as a runtime back-end. Apache Flink’s execution engine features low-latency pipelined and scalable batched data transfers and high-performance, in-memory operators for sorting and joining that gracefully go out-of-core in case of scarce memory resources. With its native support Hadoop YARN, Flink is another attractive runtime back-end for Cascading.

This talk introduces the Cascading Connector for Apache Flink. The connector translates Cascading flows into Apache Flink programs. Cascading flows executed using the Flink connector benefit from Flink’s runtime features such as its pipelined data shuffles and its efficient and robust in-memory operators. The talk describes the integration of Cascading and Flink, highlights its features, and points out its current limitations.



---------------

Bring your data
After the talks, while having a drink, there's the opportunity to work together with Flink committers on an interesting data problem you're facing.
Please contact Kostas Tzoumas at kostas@data-artisans.com if you're interested in taking part in this!


About Sebastian



I’m currently a PhD student at the Database Systems and Information Management Group (DIMA) of TU Berlin with Prof. Volker Markl.

My research aims at improving the technology for performing large scale data analysis on parallel processing platforms. Use case-wise, my focus is on enabling Collaborative Filtering with billions of interactions and Graph Mining on graphs with billions of vertices and edges. I am also engaged in Open Source as a member of the Apache Software Foundation, where I’m a committer and PMC member in the Mahout, Giraph and Flink projects.
During my PhD, I have been interning at IBM Research Almaden and Twitter in California. After my upcoming graduation, I will join Amazon Berlin as a Machine Learning Scientist and Post-Doctoral Researcher.


About Fabian
Fabian Hueske is a PMC member of Apache Flink. He started working on this project as part of his PhD studies at TU Berlin in 2009. Fabian did internships with IBM Research, SAP Research, and Microsoft Research and is a co-founder of data Artisans, a Berlin-based start-up devoted to foster Apache Flink. He is frequently giving talks on Apache Flink at conferences and meetups. Fabian is interested in distributed data processing and query optimization

---------------
Schedule

19:00 - 19:30: Sandwiches and Drinks

19:30 - 19:45: Flink Community Update (Robert Metzger)
19:45 - 20:45: Tracking the Trackers with Apache Flink (Sebastian Schelter)


20:45 - 21:00: Break

21:00 - 21:30:
Cascading on Apache Flink (Fabian Hueske)
21:30 - End: Socializing and Drinks

Photo of Apache Flink Meetup Berlin group
Apache Flink Meetup Berlin
See more events
Betahaus Cafe
Prinzessinnenstrasse 19-20 10969, · Berlin