GOTO Night: Stream Processing with Apache Flink and Mining Github

This is a past event

61 people went

Location image of event venue


Venue/Host: Trifork
Costs: Free of charge
Speakers: Robert Metzger & Georgios Gousios
Pizza & refreshments included

18:00 Registration & Pizza
18:30 Short Intro
18:35 Georgios Gousios
19:30 Short break
19:45 Robert Metzger
20:45 Ending with beers

• • •
"Mining GitHub for Fun & Profit!" by Georgios Gousios

With over 30 million repositories and 10 million users, GitHub is currently the largest code hosting site in the world. Software engineering researchers have been drawn to GitHub due to this popularity, as well as its integrated social features and the metadata that can be accessed through its API. To make research with GitHub data approachable, we created the GHTorrent project, a scalable, off-line mirror of all data offered through the GitHub API. In our talk, we describe how we setup GHTorrent, how we build a community around it, what types of research it has been used for and how Microsoft uses it to get insights from their OSS projects.

Georgios Gousios is an assistant professor at the Software Science department, Radboud University Nijmegen. He obtained his PhD in Software Engineering (software repository mining) at the Athens University of Economics and Business (AUEB). Before RU, he was a postdoctoral researcher at TU Delft (working on GitHub analytics) and a Scala hacker at GRNET (working on cloud infrastructures).

His research specialty is software analytics and large-scale empirical software engineering. He has worked in the fields of distributed software development processes, software quality, software testing, developer productivity assessment and research infrastructures. He co-edited the "Beautiful Architectures" book (OReilly, 2009). He is the main author of the GHTorrent data collection and curration framework and the Alitheia Core repository mining platform. Beyond hacking in any possible form, his interests include software engineering, software analytics and programming languages.

Twitter: @gousiosg
• • •
"Stream Processing with Apache Flink" by Robert Metzger

Data streaming is gaining popularity, as more and more organizations are realizing that the nature of their data production is continuous and unbounded, and can be better served with a streaming architecture. Streaming architectures promise decreased latency from signal to decision, a radically simplified data infrastructure architecture, and the ability to cope with new data that is generated continuously. Apache Flink is a full-featured true stream processing framework with:

- Easy to use Java- and Scala-embedded APIs that make stream analytics easy, yet provide powerful tools to deal with time and uncertainty
- Throughput close to a million of events per second per core
- Latencies as low as the millisecond range
- Full support for event time and out of order arrivals with flexible windows, watermarks, and triggers
- Exactly-once consistency guarantees, and the ability to realize distributed transactional data movement between systems (e.g., between Kafka and HDFS)
- Ease of configuration and separation between application logic and fault tolerance via a novel asynchronous heckpointing algorithm
- No single point of failure
- Integration with popular open source infrastructure (e.g., Hadoop, HBase, Kafka, Cascading, Elasticsearch, …)
- Batch processing as a special case of stream processing, including dedicated libraries for machine learning and graph processing, managed memory on-, and off-heap, and query optimization

Flink is used in several companies, including at ResearchGate, Bouygues Telecom, the Otto Group, and Capital One, and has a large and active developer community of well over 140 contributors. In this talk, we provide an overview of the system internals and its streaming-first philosophy, as well as the programming APIs.

Robert Metzger is a PMC member at the Apache Flink project and a co-founder and software engineer at Data Artisans. He is the author of many Flink components including the Kafka and YARN connectors. Robert studied Computer Science at TU Berlin and worked at IBM Germany and at the IBM Almaden Research Center in San Jose. He is a frequent speaker at conferences such as the Hadoop Summit in San Jose 2015, ApacheCon Big Data in Budapest, and meetups in Europe and the US.

Twitter: @rmetzger_