Past Meetup

BDAM 07/19 - Self-service Data Integration, Kodiak Data, and Kubernetes!

This Meetup is past

144 people went

Details

Shoutout to Kodiak Data for kindly sponsoring this meetup!

Kodiak Data will also be giving away a Google Home! Enter the raffle on the day of the event for a chance to win.

AGENDA

6:00 - 6:30 - Socialize over food and beverages
6:30 - 8:00 - Talks

TALKS

Talk #1: Self-Service Data Integration using Apache Spark: A Journey from Interactive Data Prep to Production-ready Pipelines, by Edwin Elia, Cask

Talk #2: Making Big Data Go Faster, by Morgan Littlewood, Kodiak Data

Talk #3: Data Pipelines in Kubernetes, by Sean Suchter, Pepperdata

ABSTRACTS

Talk #1: Self-Service Data Integration Using Apache Spark: A Journey from Interactive Data Prep to Production-ready Pipelines, by Edwin Elia, Cask

Enterprises are seeing an increasing need to ingest high volumes of data from a wide variety of structured and unstructured sources. The data ingestion from a variety of different sources often includes steps to cleanse, transform, and prepare the data before landing the data in a data lake. To do so, organizations are increasingly embracing the notion of self-service to allow data engineers, data scientists and citizen integrators to prepare and ingest data from a variety of different sources. In this talk, we will cover how Cask approaches data preparation and data ingestion by providing self-service tools to integrate the data, yet not compromising on strict enterprise guidelines around security and governance. We will also demonstrate the journey of a data engineer/data scientist in preparing, transforming data and building a production-grade data pipeline end-to-end, using Apache Spark, with the clicks of just a few buttons.

Talk #2: Making Big Data Go Faster, by Morgan Littlewood, Kodiak Data

When developing and deploying complex and time sensitive analytics applications, cluster resources must be configured and balanced. Multiple, complex stacks may be needed and each having its own CPU, RAM and disk capacity requirements. Storage and network performance are also critical factors in provisioning development and production clusters. How do you configure and size cluster nodes, especially when usage and data may be growing rapidly? Today, many data teams over-engineer their production clusters on ‘bare metal’ servers, however, Big Data infrastructure can be shared across many clusters. Kodiak Data software enables a ‘virtual cluster infrastructure’ (VCI) where ‘virtual clusters’ are isolated from each other and abstracted on the physical infrastructure. Cluster virtualization simplifies operations and significantly improves asset utilization. Application and ALL popular BD stacks run unchanged and orchestration software such as Kubernetes and Mesos can still be used.

Talk #3: Data Pipelines in Kubernetes, by Sean Suchter, Pepperdata

Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure.

Kubernetes is already used extensively to run stateless applications both on-premise and on the cloud. It is increasingly being used for stateful applications like databases, message queues, etc as well. An emerging use-case is data processing workloads. Some of the efforts of the open source community over the past few months have been to support this - by enabling workloads like Spark and HDFS to run well on Kubernetes. In this talk, I cover the various parts of a containerized data processing pipeline in Kubernetes using an example, and talk briefly about trade-offs and performance considerations.

SPEAKER BIOS

• Edwin Elia is a Front End Engineer at Cask, building user interfaces to make interacting with data and applications simpler and faster. Previously, Edwin was a Business Systems Analyst for a pension fund in Michigan, where he worked on designing sales process workflows and document management system.

• Morgan Littlewood is a founder at Kodiak Data, managing products and operations. Kodiak's software and MemCloud services provide a faster and more economic virtual infrastructure for data-intensive clusters. Previously, Morgan was a VP at Violin Memory (industry's 1st flash storage arrays) and had senior product management roles at Cisco (MPLS and high-end routing).

• Sean Suchter is co-founder and CTO at Pepperdata. Before Pepperdata, Sean was the founding GM of Microsoft’s Silicon Valley Search Technology Center, where he led the integration of Facebook and Twitter content into Bing search. Prior to Microsoft, Sean managed the Yahoo Search Technology team, the first production user of Hadoop. Sean joined Yahoo through the acquisition of Inktomi, and holds a B.S. in Engineering and Applied Science from Caltech.

ARRIVAL AND PARKING

Cask HQ is a few minutes walk from the California Avenue Caltrain Station.

Also, Cask HQ has its own parking lot, but it will certainly not accommodate all guests. Please use parking lots available nearby: