Skip to content

Spark 1.5: Real-time, Advanced Analytics, Kafka, Cassandra, ES, Zeppelin, Docker

Photo of Vladimir G
Hosted By
Vladimir G.
Spark 1.5: Real-time, Advanced Analytics, Kafka, Cassandra, ES, Zeppelin, Docker

Details

Hello,

I am glad to invite you again for another Spark meetup. This time we are going to host presentation of our special guest - Chris Fregly (https://www.meetup.com/members/14250450/), an organizer of Advanced Apache Spark meetup, San Francisco (https://www.meetup.com/Advanced-Apache-Spark-Meetup/)!

Abstract

Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:

  1. Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift

  2. Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC

  3. Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD

  4. Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird

  5. Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP

  6. Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll

Demos

This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above. All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki

In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/

Speaker Bio

Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark. Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix. When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.

Photo of Helsinki Apache Spark Meetup #helspark group
Helsinki Apache Spark Meetup #helspark
See more events
Kiosked
Keilaranta 1 · Espoo