Distributed Tensorflow, Tensorflow XLA JIT Compiler, Spark 2.0 Streaming, Mesos

Advanced Kubeflow Meetup (San Francisco, Global)
Advanced Kubeflow Meetup (San Francisco, Global)
Public group


600 Townsend St #200 · San Francisco, CA

How to find us

Follow signs and/or ask the building people where PagerDuty is located (2nd Floor)

Location image of event venue


Talk 0: Meetup and Technology Updates (Chris Fregly)

Announcing the first ever ...

** PipelineIO GPU Deep Learning Summit West 2017 **

• RSVP Here: https://pipeline-ai-gpu-dev-summit-west-tensorflow-2017.eventbrite.com (https://pipeline-ai-gpu-dev-summit-west-tensorflow-2017.eventbrite.com/)

• Sept 16, 2017 @ Santa Clara Convention Center!

• Only 500 available spots

• Every attendee will get a GPU instance for the day

• Together, we will build the largest, hybrid-cloud Spark, Tensorflow, and GPU Cluster in the World!!

• RSVP Here: https://pipeline-ai-gpu-dev-summit-west-tensorflow-2017.eventbrite.com

Spark Summit East 2017

• Largely a snooze. I don't pay $ anymore. Only free live stream.

• GPU story is still very weak

• Reminder that Spark is meant for general purpose ETL

• There are better alternatives for AI/ML and GPUs

• Even the TensorFrames developer, Tim Hunter, spoke nothing about TensorFrames - despite the title of his talk

• Databricks confusingly announced a proprietary S3 Caching Service - disguised as open source until audience member called them out

• This is the start of a series of proprietary extensions to Spark internally called "Project Edge" by Databricks.

• "Project Edge" is designed to give Databricks the "edge" over other Spark distributions

• Represents a very scary trend for Spark

• Given the small number of Databricks Engineers (mostly open source committers) are now focused on proprietary extensions.

• Not good... Keep an eye on this over the next 6-12 months

Yahoo's TensorFlow On Spark (github (https://github.com/yahoo/TensorFlowOnSpark))

• Tensorflow (ML/AI) + Spark (ETL) ... Finally Done Right!

• Again, TensorFrames is a dead project

• Thanks Andy Feng (https://www.linkedin.com/in/afeng/) @ Yahoo!!

Tensorflow Dev Summit West 2017

• Super exciting - and free!

• Every major area of Tensorflow was covered

Tensorflow Core

• SavedModel is now standard way of saving a Tensorflow Model (finally!)

• Java API (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/java) - mostly for inference at this time

Tensorflow Ecosystem (https://github.com/tensorflow/ecosystem)

• Interoperability between Spark/Hadoop Files and Tensorflow TFRecords

• HDFS Support (https://www.tensorflow.org/versions/master/how_tos/hadoop/)

Tensorflow Development

• Debugger (tfdbg (https://www.tensorflow.org/versions/r1.0/how_tos/debugger/)), (examples (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/python/debug/examples))

• Tensorflow Timeline Visualization Tool (issue (https://github.com/tensorflow/tensorflow/issues/1824)) (github (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/client/timeline.py))
(uses Chrome Trace Format (https://google.github.io/tracing-framework/overview.html))

Distributed Tensorflow

• More than just Round Robin task-placement strategy

Tensorflow Serving

• Multi-headed inference capabilities on a single trained model

Tensorflow Performance

• XLA (https://www.tensorflow.org/versions/master/experimental/xla/): Xcelerated Linear Algebra, JIT compiler

• tfcompile: AOT compiler creates platform-specific binaries for x86, ARM, GPU

• Operation fusing (similar to Spark fusing/pipelining)

• Whole stage and vectorization optimizations (similar to Spark)

Nvidia GPUs

• TensorRT (https://developer.nvidia.com/tensorrt) (formerly known as GPU Inference Engine) Early Access Program

• INT8 + FP16 Half-precision Optimized Inference for Tesla P100 and Jetson TX1 GPUs

Scikit-Learn Support in PipelineIO

• Supports any python snippet!


• High Performance Spark ML and Tensorflow AI Model Serving

• Request Batching and Circuit Breakers with NetflixOSS (load test)

• Latency and Batching Metrics using Prometheus + Kubernetes + NetflixOSS

• Serving Scikit Learn Models and any Python code (ie. AWS Lambda)

• Creating a Spark Job from Jupyter Notebook

• Distributed Tensorflow + Tensorboard + HDFS

• Chaos Kube: NetflixOSS-style Chaos Monkey + Kubernetes

"Yee Hawww! Kill them Docker Containers!!"


Talk 1: PagerDuty's Real-time, High-scale Predictive Analytics Engine using Spark ML, Spark 2.0 Structured Streaming, and Kafka
(Anna Khasanova (https://www.linkedin.com/in/anna-khasanova-a2205529), Data Scientist and Software Engineer @ PagerDuty (https://www.pagerduty.com/))

We're very excited to have Anna present PagerDuty's predictive analytics data pipeline that powers their new "Events" feature.

Events include things like code deploys and configuration changes - anything that could potentially lead to an outage.

By analyzing the real-time stream of events - along with the real-time stream of metrics that PagerDuty already collects.

PagerDuty's Scala, Python, Docker, and Mesos-based pipeline will warn customers of potential outages before they occur.

Anna's team has been using Spark for since Spark 1.4 - and has always pushed the limits of Spark, Spark Streaming, and Spark ML in terms of scale and functionality.


Talk 2: Tensorflow XLA (https://www.tensorflow.org/versions/master/experimental/xla/) JIT Compiler + Batch Normalization (https://www.tensorflow.org/versions/master/api_docs/python/nn/#batch_normalization) (Fabrizio Milo (https://www.linkedin.com/in/fmilo/), Tensorflow Contributor and Deep Learning Engineer @ H20.ai (http://h2o.ai/))

Fabrizio walks us through how it works and how to apply it to our Tensorflow AI model deployments.

To visualize the new compiler improvements we will use Tensorflow's Timeline Visualization Tool (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/client/timeline.py) - and show how to interpret them.

Bonus: What is Batch Normalization (https://www.tensorflow.org/versions/master/api_docs/python/nn/#batch_normalization)? And why should you use it?


Talk 3: Building Real Time Analytic Pipelines with SMACK + ElasticSearch on DC/OS

(Chris Gutierrez (https://www.linkedin.com/in/christophergutierrez/), Head of Analytics, and Sunil Shah (https://www.linkedin.com/in/geekonabicycle/), Engineering Manager, Mesosphere (http://mesosphere.io))

This talk will cover building a SMACK stack for data analytics on DC/OS. We'll create a demo installing and configuring a salable cluster using DC/OS.

The demo will simulate a real time visualization using Elastic Search.


Talk 4: Distributed Tensorflow and Tensorboard + High Performance Tensorflow Serving and Request Batching + Kubernetes and Prometheus Metrics Collection for Prediction Services (Chris Fregly (https://www.linkedin.com/in/cfregly) from PipelineIO (http://pipeline.io/))

Distributed Tensorflow + Tensorboard

Tensorflow + HDFS

Hybrid Cloud Deployments

eXtreme High Availability (XHA)

Tensorflow Serving

Request Batching

Prometheus-based Metrics Collection for Prediction Services

Speaker Bio

Chris Fregly (https://www.linkedin.com/in/cfregly) is a Research Scientist at PipelineIO (http://pipeline.io/) - a Machine Learning and Artificial Intelligence Startup in San Francisco.

Chris is an Apache Spark Contributor, Netflix Open Source Committer, Founder of the Global Advanced Spark and TensorFlow Meetup, and Author of the Upcoming O'Reilly Video Series, "Deploying and Scaling Tensorflow Distributed in Production."

Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member of the IBM Spark Technology Center in San Francisco.

Related Links

• (https://developer.nvidia.com/tensorrt) https://github.com/tensorflow/tensorflow/tree/master/tensorflow/java

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/java/src/main/java/org/tensorflow/examples/LabelImage.java (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/java/src/main/java/org/tensorflow/examples/LabelImage.java)


https://devblogs.nvidia.com/parallelforall/production-deep-learning-nvidia-gpu-inference-engine/ (https://devblogs.nvidia.com/parallelforall/production-deep-learning-nvidia-gpu-inference-engine/)







https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/client/timeline.py (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/client/timeline.py)