There's so much going on in the world of data that it can be hard to keep up with what's happening in your own speciality area let alone make connections to others who might have complementary skills or interests. This Meetup is intended to make it easier to stay informed and to make those connections. Its focus is on what people working with data in Wellington-based public, private, non-profit, and academic organisations are doing, what challenges they're experiencing and what they need help with. It welcomes members who spend their days capturing, storing, manipulating and analysing data as well as those who use data generated by others for decision- and policy-making.
5:30 - 6:00 -- Drinks, nibbles and chatting
6:00 - 7:00 -- Neal Glew, talking about the Apache Beam Project (bio and abstract below)
7:00 - 7:30 -- Drinks, nibbles and chatting
Neal Glew is a software engineer in the Flume project at Google, where he mostly works on the shuffle system. He previously worked at Intel on parallel programming models within Intel Labs. He has a PhD in computer science from Cornell University and a BSc(hons) in computer science from Victoria University of Wellington.
Apache Beam (https://beam.apache.org/) is an open-source project for writing big-data pipelines.
In the first part of this talk, I'll describe Beam from a non-technical perspective - what it is, why you would use it, how it compares to other technologies in the big data space.
In the second half of the talk I will go into a high-level overview of the technical aspects of Beam. In particular, its heart is a programming model that unifies both batch and stream processing, allowing the programmer to separate the what, where, when, and how of processing. What actual processing is performed on the data. Where in event time is that processing done - how are event times windowed. When in processing time to materialise results. How are updates of results (due e.g. to late data) combined. Beam also provides several language-specific SDKs that instantiate the model for particular languages. Currently Java and Python are available and Go is under development.
Beam also provides a portability framework that allows pipelines to be run on a variety of execution technologies. Beam itself provides a reference runner. There are also efforts to develop runners based on Apache Flink and Apache Spark. Google provides a commercial managed runner on its Google Cloud. Beam builds on the work of Map Reduce, Hadoop, Flume, Spark, and Flink.