- Building a Spark-based Insider Threat Detection Solution for a Major US Bank
In this interactive meetup the StreamAnalytix team will demonstrate how they built an Apache Spark-based insider threat detection solution to replace the legacy system of a major US-based bank. The session will explore the business problem, the legacy system challenges, the solution implementation, and the results. It will be followed by a detailed overview of StreamAnalytix, an Apache Spark-based analytics platform that supports the end-to-end functionality of data ingestion, enrichment, application of complex business rules, and machine learning on cloud. Note: This presentation is co-listed with the Big Data ATL Group Speaker: Sonam Sharma Solution Architect, Impetus Technologies Sonam Sharma is a Solutions Architect for StreamAnalytix and brings in rich big data experience with expertise in implementing solutions built on Apache Spark and Storm in various industries like telecom, retail, etc
- Databricks New Features and Future Directions
Databricks engineering will join us to discuss new platform features and offerings highlighted at the April Spark AI Summit in San Francisco. These include the introduction of Koalas to enable scaling Pandas, ML Flow enhancements, and Delta for improved management of data in motion and at rest. This is a great opportunity to see how these features can accelerate productivity for Spark developers and Spark workloads in cloud environments. Networking will begin before the meeting and the presentation will start around 6:30. Keith Anderson is our Speaker from Databricks Keith is currently a solutions architect with Databricks based out of Chapel Hill, NC. Keith has an engineering degree from Rutgers University and has spent the last 20+ years working in various IT roles for both software vendors and end customer accounts including Commvault, GlaxoSmithKline, and Dell-EMC. When he isn’t working, Keith is usually hanging out with his wife, two daughters, and his Jeep Wrangler. https://www.linkedin.com/in/keith-anderson-2480b027/ The event is hosted at the Shadow-Soft offices - https://shadow-soft.com.
- Chick-fil-A Spark Use Case for Data & Analytics
The Chick-fil-A team will be presenting how the business and IT collaborated to implement Spark for Analytical use cases. Over the past few years, the concept of a “Data Lake” has become more and more widespread as organizations have seen the importance of centralizing key data assets for applications and analytics. Many companies have been on this journey, but at Chick-fil-A, just a “Data Lake” wasn’t enough, they had to build an ecosystem of tools focused around their core users, Data Engineers and Data Analysts. So what does this mean? It means that in order to facilitate this process well, there needed to be some drastic changes in how Chick-fil-A thought about and approached data. As data volumes got bigger, there needed to be robust ETL mechanisms to optimize incoming data for “Big Data” transformations. If you are thinking that Apache Spark can fit the bill, you are correct! During this talk we will share: • Chick-fil-A’s “Why” for starting this journey • What we hope to achieve • The highs and the lows of our Spark journey • Where we are today and where we want to be with Data & Analytics Presenters: Korri Jones is a Lead Analyst in Enterprise Analytics at Chick-fil-A Corporate in Atlanta, GA. Prior to his current work at Chick-fil-A, he worked as a Business Analyst and Product trainer for NavMD, Inc, was an Adjunct Professor at Roane State Community College and Instructor for the Project GRAD summer program at Pellissippi State Community College and the University of Tennessee in Knoxville. Alberto Rama has worked for Chick-fil-A Corporate in Atlanta, GA for 10 years, currently as a Lead Software Engineer in Data & Analytics. Prior to venturing in the Restaurant Industry, he worked in Government, Financial Services and Telecommunications for clients in the U.S.A, Europe, and Latin America.
- Laying the Foundation for Ionic Platform Insights on Spark
The Ionic Analytics team shares insights about the system they built using Spark and Databricks to enable low cost, flexible reporting and lay a foundation for advanced analytics. They will cover the whole lifecycle of the project, including build tools, testing, CI, and deployment, to provide a broad overview of what they have learned putting together a production-ready Spark application.
- Building IoT Pipelines with Spark, Kafka and MemSQL
Dale Deloy will demonstrate how to ingest data into MemSQL with the massively parallel processes of Spark and MemSQL. The use case will feature JSON documents from sensors being pushed through a Kafka Topic directly into a MemSQL database. Then, MemSQL JSON processing will be leveraged for high-speed query access from BI/SQL tools to deliver executive dashboards. This meetup is happening in conjunction with the Atlanta Hadoop User's Group (https://www.meetup.com/Atlanta-Hadoop-Users-Group).
- Spark 2.3 Update, Machine Learning Pipelines Intro, and CI/CD Howto
Two presentations from Databricks tonight! First, Joe Kambourakis will focus on new updates from the March release of Spark 2.3 such as streaming, vectorized UDFs, and Kubernetes. Next, he'll go through an introduction to Spark machine learning pipelines with Decision Tree Classifier models. Next, Pete Tamisin will show how to use the Databricks REST API and built in integration with Github to build Continuous Integration/Deployment/Delivery (CI/CD) pipelines for your notebooks and jobs. His talk will focus on using Jenkins and Python scripts to deploy promoted code branches to various environments and on running smoke tests to validate each deployment. Speakers: Joe Kambourakis is a data science instructor at Databricks. He has more than 10 years of experience teaching, over five of them with data science and analytics. Previously, Joseph was an instructor at Cloudera and a technical sales engineer at IBM. He has taught in over a dozen countries around the world and been featured on Japanese television and in Saudi newspapers. He is a rabid Arsenal FC supporter and competitive Magic: The Gathering player. Joseph holds a BS in Electrical and Computer Engineering from Worcester Polytechnic Institute, and an MBA with a focus in Analytics from Bentley University. He lives with his wife and daughter in Needham, MA. Over the course of his 20+ year career, Pete Tamisin has fulfilled many roles, including consultant, solution architect, database administrator, data modeler, web developer, trainer, team leader and product manager. Based in Atlanta, GA, Pete has delivered projects of varying sizes across multiple verticals, including utilities, financials, higher education and manufacturing. Currently, he is a Customer Success Engineer with Databricks, where he provides customers with the support, training and information they need to be successful on the Databricks platform.
- How to Share State Across Multiple Spark Jobs using Apache Ignite
This session will demonstrate how to easily share state in-memory across multiple Spark jobs, either within the same application or between different Spark applications using an implementation of the Spark RDD abstraction provided in Apache Ignite (https://ignite.apache.org/). During the talk, attendees will learn in detail how IgniteRDD (https://ignite.apache.org/features/igniterdd.html) - an implementation of native Spark RDD and DataFrame APIs - shares the state of the RDD across other Spark jobs, applications and workers. Examples will show how IgniteRDD allows execution of SQL queries many times faster than native Spark RDDs or Data Frames due to its advanced in-memory indexing capabilities. Presenter: Akmal Chaudhri, PhD., is a Technology Evangelist for GridGain Systems (https://www.gridgain.com/). His role is to help build the global Apache Ignite community and raise awareness through presentations and technical writing. Akmal has over 25 years experience in IT and has previously held roles as a developer, consultant, product strategist and technical trainer. He has worked for several blue-chip companies such as Reuters and IBM, and also the Big Data startups Hortonworks (Hadoop) and DataStax (Cassandra NoSQL Database).
- Intro to Building a Distributed Pipeline for Real Time Analysis of Uber's Data
Introduction to Building a Distributed Machine Learning Pipeline for Real Time Analysis of Uber Data Using Apache APIs: Kafka, Spark, and HBase In this talk we will look at a solution that combines real-time data streams with machine learning to analyze and visualize popular Uber trip locations in New York City. You will see the end-to-end process required to build this application using Apache APIs for Kafka, Spark, and HBase. According to Gartner, by 2020, smart cities will be using about 1.39 billion connected cars, IoT sensors and devices. The analysis of behavior patterns within cities will allow optimization of traffic, better planning decisions, and smarter advertising. You may be excited about the possibilities of exploiting data streams to gain actionable insights from continuously produced data in real-time but you may find it difficult to conceptualize how to implement such a solution. We will walk you through an architecture that combines data streaming with machine learning to enhance Uber trip data to analyze and visualize the most popular pick-up/drop-off locations by date and time so that drivers’ locations could be optimized and priced according to demand. The presentation will consist of four sections: • Introduction to Spark machine learning for developers • Kafka and Spark Streaming • Real time dashboard using a micro service framework • Using the Spark HBase connector for parallel writes and reads About the speaker: Carol McDonald is a solutions architect at MapR focusing on big data, Apache Kafka, Apache HBase, Apache Drill, Apache Spark, and machine learning in healthcare, finance, and telecom. Previously, Carol worked as a Technology Evangelist for Sun, an architect/developer on: a large health information exchange, a large loan application for a leading bank, pharmaceutical applications for Roche, telecom applications for HP, messaging applications for IBM, and sigint applications for the NSA. Carol holds an MS in computer science from the University of Tennessee and a BS in geology from Vanderbilt University. About our host venue: Honeywell creates some of the world’s most sophisticated software-based technologies that play a major role in the Internet of Things (IoT), helping everything from aircraft, cars, homes and buildings, manufacturing plants, supply chains, and workers become more connected to make our world smarter, safer, and more sustainable. If you want to make a difference in these critical industries, you can apply here – https://www.honeywell.com/careers
- One Night, Three Presentations!
We have a team of presenters from IBM who have crafted a fast-moving set of presentations. Here's a quick summary of each: Supporting Highly Multitenant Spark Notebook Workloads: Best practices and useful patches Presenters: Brad Kaiser and Craig Ingram Notebooks: They enable our users, but they can cripple our clusters. Let’s fix that. Notebooks have soared in popularity at companies world-wide because they provide an easy, user-friendly way of accessing the cluster-computing power of Spark. But the more users you have hitting a cluster, the harder it is to manage the cluster resources as big, long-running jobs start to starve out small, short-running jobs. While you could have users spin up EMR-style clusters, this reduces the ability to take advantage of the collaborative nature of notebooks. It also quickly becomes expensive as clusters sit idle for long periods of time waiting on single users. What we want is fair, efficient resource utilization on a large single cluster for a large number of users. In this talk we’ll discuss dynamic allocation and the best practices for configuring the current version of Spark as-is to help solve this problem. We’ll also present new improvements we’ve made to address this use case. These include: decommissioning executors without losing cached data, proactively shutting down executors to prevent starvation, and improving the start times of new executors. Spark-Bench: Simulate, Test, Compare, and Yes, Even Benchmark! Presenter: Emily May Curtin https://sparktc.github.io/spark-bench/ Spark-Bench is an open-source benchmarking tool, and it’s also so much more. Spark-Bench is a flexible system for simulating, comparing, testing, and benchmarking Spark applications and Spark itself. Spark-Bench originally began as a benchmarking suite to get timing numbers on very specific algorithms mostly in the machine learning domain. Since then it has morphed into a highly configurable and flexible framework suitable for many use cases. This talk will discuss the high level design and capabilities of spark-bench before walking through some major, practical use cases. Use cases include, but are certainly not limited to: regression testing changes to Spark; comparing performance of different hardware and Spark tuning options; simulating multiple notebook users hitting a cluster at the same time; comparing parameters of a machine learning algorithm on the same set of data; providing insight into bottlenecks through use of compute-intensive and i/o-intensive workloads; and, yes, even benchmarking! Spark-Tracing: A flexible instrumentation package for visualizing the internal operation of Apache Spark Presenter: Matthew Schauer When tuning a large Spark job or making changes to Spark itself, finding bottlenecks can be difficult. To alleviate this problem, Spark-Tracing uses Java's bytecode instrumentation facility to record a variety of user-configurable data during the Spark run, such as the content of inter-process communications and the amount of time spent in various functions, and displays these as an easy-to-read interactive sequence diagram. This allows the user to see what Spark is doing at all times at scales ranging from the entire run down to individual milliseconds. Furthermore, the customizability of Spark-Tracing allows users to seamlessly add instrumentation to their own jobs and third-party libraries.
- Getting Started with Structured Streaming
Structured Streaming was introduced in Spark 2.0 as a streaming component to Spark SQL’s very popular Dataframe API. Streaming problems are challenging in nature; despite interest in exploring streaming applications, many Spark users experience slow adoption caused by a steep learning curve. The purpose of this talk is to provide a guided introduction to structured streaming -- from understanding API internals to providing business context and real examples to help you get started. During this talk you will learn: • Pain points of Spark Streaming with DStreams • Structured Streaming programming model • API features and best practices • Advantages of structured streaming versus other streaming engines • Common public datasets for testing structured streaming • How to get started, including a live demo with published code Speaker Bio Myles Baker is a Solutions Architect who helps large enterprises develop Apache Spark applications using Databricks. He specializes in streaming and machine learning. His work on image processing software at NASA introduced him to distributed computing, and since then he has helped clients build data science models and applications at-scale spanning multiple industries. He received a B.S. in Applied Mathematics from Baylor University and an M.S. in Computer Science from the College of William and Mary.