Stream processing with Apache Beam and Spark


Details
18:00 - 18:30 - Mingling
18:30 - 19:15 - What we’ve learned (so far) from developing a stream-processing platform @ PayPal scale - Amit Sela @ PayPal
19:15 - 20:00 - Fundamentals of stream processing with Apache Beam - Tyler Akidau @ Google
http://photos3.meetupstatic.com/photos/event/b/9/c/600_450182972.jpeg
“What we’ve learned (so far) from developing a stream-processing platform @PayPal scale”
Abstract:
PayPal is the Payment industry’s leader in Risk management. Using our data, machine learning, and human detective work, we are able to
Accurately detect fraud and separate good users from bad actors - in real time at very large scale.
A year ago, we embarked on re-inventing Risk's Data platform, to support PayPal’s growth and to maintain our competitive advantage in Risk and fraud detection.
And the first component we’re releasing is how we manage data in motion – I.e. Stream processing.
What can streaming offer as a computational platform? Where are it’s strengths?
How to choose the right technology for you ? And why we chose Spark.
What were the challenges we found with stream processing ? And how we overcame some of them. What are still gaps, and how does is it all relate to “modeling” the problem of stream processing.
Where Apache Spark is going (2.0) ? And how this all comes together nicely.
Bio:
Amit Sela (https://www.linkedin.com/in/amit-sela-7aa05035) is a Senior Software Engineer @ PayPal and a committer for Apache Beam, currently working on Risk’s next generation Big Data platform focusing on stream-processing. Amit is also an open-source enthusiast who spent the past 5 years working with Hadoop, HBase, Sqoop, Spark and Kafka, and recently got the chance to give something back to the community by working on the Spark runner for Apache Beam.
Fundamentals of stream processing with Apache Beam
Abstract:
Apache Beam (http://beam.incubator.apache.org/)(unified Batch and strEAM processing!) is a new Apache incubator project. Originally based on years of experience developing Big Data infrastructure within Google (such as MapReduce, FlumeJava, and MillWheel), it has now been donated to the OSS community at large.
Come learn about the fundamentals of out-of-order stream processing, and how Beam’s powerful tools for reasoning about time greatly simplify this complex task. Beam provides a model that allows developers to focus on the four important questions that must be answered by any stream processing pipeline:
What results are being calculated?
Where in event time are they calculated?
When in processing time are they materialized?
How do refinements of results relate?
Furthermore, by cleanly separating these questions from runtime characteristics, Beam programs become portable across multiple runtime environments, both proprietary (e.g., Google Cloud Dataflow) and open-source (e.g., Flink, Spark, et al).
Bio:
Tyler Akidau (https://www.linkedin.com/in/tyler-akidau-5221672) is a Staff Software Engineer @ Google. The current tech lead for internal streaming data processing systems (e.g. MillWheel), he’s spent six years working on massive-scale streaming data processing systems. He passionately believes in streaming data processing as the more general model of large-scale computation.

Stream processing with Apache Beam and Spark