100% Streaming!!! Google Cloud + Apache Beam (Google) + Flink (Data Artisans)


Details
Aligned with Strata San Jose 2017 (Mar 13-16)
-
FREE AND OPEN TO THE PUBLIC *
-
YOU DO NOT NEED A CONFERENCE TICKET TO ATTEND *
-
THIS EVENT IS AT THE SAN JOSE CONVENTION CENTER *
-
ROOM LL21B *
-
AGAIN, THIS IS FREE AND OPEN TO THE PUBLIC*
Talk 0: Meetup and Technology Updates (Chris Fregly (http://linkedin.com/in/cfregly), PipelineIO (http://pipeline.io/))
• We hit 6000 members!!
• Number #3 Spark Meetup in the World (#1 Tensorflow Meetup)
• More coming soon...
Talk 1: Reference Architecture for In-Stream Processing Service, Illustrated by Real-time Twitter Sentiment Analytics application (Victoria Livschitz (https://www.linkedin.com/in/victorialivschitz), Founder/CTO of GridDynamics (https://www.griddynamics.com/))
Stream processing architectures are rapidly emerging in various business domains to offer real-time ML and processing applications such as fraud detection, risk management, sentiment analytics and many more.
Topics presented by Victoria will include:
• Introduction to In-Stream Processing
• Introduction to Real-time sentiment analysis of Twitter streams applications
• Overview of Reference Architecture (RA) for ISP using Kafka/Spark Streaming/ Cassandra/Redis/HDFS
• Overview of Reference Implementation (RI) and devops stack for portable cloud deployment using Docker, Mesos/Marathon, Ansible and Tonomi
• Demonstration of all technologies at work
Speaker Bio
Victoria Livschitz (https://www.linkedin.com/in/victorialivschitz) is a founder and CTO of Grid Dynamics (https://www.griddynamics.com/), the engineering servicescompany that specializes, amongst other things, in the design of big data, real-time analytics and machine learning applications using open source technologies running in a cloud-portable immutable infrastructure. Victoria will present the reference architecture for In-Stream Processing Service that designed with 100% open source components and runs on any cloud.
Talk 2: Deep Dive into Apache Beam (Tyler Akidau (https://www.linkedin.com/in/tyler-akidau-5221672) and Reuven Lax (https://www.linkedin.com/in/reuven-lax-a82818/), Google)
In this talk, Tyler will describe the architecture of Apache Beam and Google Cloud Dataflow - a well-designed, high-performance streaming engine.
Apache Beam is the set of open source SDKs for writing pipelines, and you can then run these Beam pipelines on any platform with a supported Runner (currently: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow).
Cloud Dataflow is Google's closed-source execution engine provided as a managed service on Google Cloud for running Beam pipelines.
The goal of this talk is to understand the unique patterns, design choices, and trade-offs that make Apache Beam and Cloud Dataflow compelling options.
Speaker Bio
Tyler Akidau (https://www.linkedin.com/in/tyler-akidau-5221672) and Reuven Lax (https://www.linkedin.com/in/reuven-lax-a82818/) are Software Engineers at Google. Tyler is focused on the Apache Beam streaming programming model. Reuven is focused on the Google Cloud Dataflow streaming execution engine.
Talk 3: Deep Dive into Flink Streaming (Jamie Grier (https://www.linkedin.com/in/jamiegrier), Data Artisans)
In this talk, Jamie will describe the architecture of Flink Streaming - a well-designed, high-performance streaming engine.
While some comparisons will be made to Spark Streaming, this talk is not intended to convince people to switch to Flink Streaming.
The goal of this talk is to understand the unique patterns, design choices, and trade-offs that make Flink Streaming a compelling option.
Speaker Bio
Jamie Grier (https://www.linkedin.com/in/jamiegrier) (based in San Francisco) is Director of Application Development at Data Artisans (based in Berlin).
Jamie also serves as Developer Advocate, Solution Architect, Sales Engineer, and many other roles required by a startup!
Jamie's wife recently had a baby, so please congratulate him!
Talk 4: Incremental, Online, Continuous, and Parallel Training and Serving of Spark ML and TensorFlow Models with Kafka, Docker, and Kubernetes (Chris Fregly (http://linkedin.com/in/cfregly), PipelineIO (http://pipeline.io/))
The goal of this talk is to build and demo a continuous-delivery, Spark ML and TensorFlow Model training and serving pipeline running in parallel using Kafka with Docker, Kubernetes, and Netflix Open Source.
Speaker Bio
Chris Fregly (https://www.linkedin.com/in/cfregly) is a Research Scientist at PipelineIO (http://pipeline.io) - a Streaming Analytics and Machine Learning Startup in San Francisco.
Chris is also an Apache Spark Contributor, Netflix Open Source Committer, Founder of the Global Advanced Spark and TensorFlow Meetup, and Creator of the upcoming O'Reilly Video Series on Deploying and Scaling Tensorflow Distributed and Tensorflow Serving in Production.
Previously, Chris was a Distributed Systems Engineer at Databricks and Netflix - as well as a founding member of the IBM Spark Technology Center in San Francisco.
Relevant Concepts and Links
-
Direct KafkaRDD API (Performance and Fault Tolerance)
-
Back Pressure and Rate Limiting (Production Readiness)
-
Micro-batch Job Scheduling (Performance)
-
Write-Ahead Log (Fault Tolerance)
-
Best Practices for Spark Streaming Overall
-
Load Balance and Scale Out Streaming Receivers across the Cluster
-
Reliably Re-launch Failed Receivers
-
Secor: Popular open source project from Pinterest to write directly from Kafka to S3
http://www.virdata.com/tuning-spark/
https://forums.databricks.com/questions/1276/kafka-direct-api-from-spark-streaming-what-happens.html
https://www.sigmoid.com/spark-streaming-code/
https://github.com/pinterest/secor
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
https://issues.apache.org/jira/browse/SPARK-10816
http://beam.incubator.apache.org/ (http://beam.incubator.apache.org/)

100% Streaming!!! Google Cloud + Apache Beam (Google) + Flink (Data Artisans)