Skip to content

A "Must See" Event! Deep Dive with Spark Contributor Chris Fregly

Photo of Craig Warman
Hosted By
Craig W.
A "Must See" Event! Deep Dive with Spark Contributor Chris Fregly

Details

Research Scientist Chris Fregly (https://www.linkedin.com/in/cfregly) will be kicking off our inaugural meeting with an interactive deep-dive presentation that you won't want to miss!

Chris is an Apache Spark Contributor, a Netflix Open Source Committer, founder of the wildly-popular Advanced Spark and TensorFlow Meetup (https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/) in San Francisco, and author of the upcoming Advanced Spark (http://advancedspark.com/) book. He actively shares his knowledge at meetups and conferences throughout the world (http://www.slideshare.net/cfregly) - in fact, he's also presenting at MLconf (http://mlconf.com/events/atlanta-ga/) the following afternoon.

Bottom line: Chris knows Apache Spark. We couldn't ask for a better speaker to kick off the Atlanta Apache Spark User Group's first meetup event!

Details:
Chris will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:

• Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift

• Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC

• Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD

• Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird

• Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP

• Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll

Resources:
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.

All demo code is available on Github here: https://github.com/fluxcapacitor/pipeline/wiki

In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub here: https://hub.docker.com/r/fluxcapacitor/pipeline/

Photo of Atlanta Apache Spark User Group group
Atlanta Apache Spark User Group
See more events