Efficient Data Engineering with Apache Spark, Hive, and Alluxio on S3


Welcome to the first event of the Cloud, Data, & Orchestration Austin Meetup (https://www.meetup.com/Cloud-Data-Orchestration-Austin/)! This meetup will feature two talks and an opportunity to engage with other data engineers, developers, and Alluxio (www.alluxio.io/) users. Thanks to Bazaarvoice for hosting!

6:00pm: Happy Hour and networking
6:30pm: 1st talk - How to build a cloud native stack for analytics with Spark, Hive, and Alluxio on S3
7:00pm: 2nd talk - Getting started with Apache Spark and Alluxio for blazingly fast analytics
7:30pm: Q&A & Mingle

Talk 1: How to build a cloud native stack for analytics with Spark, Hive, and Alluxio on S3

At Bazaarvoice, a software-as-a-service digital marketing company, the data engineering team is tasked to handle data at massive Internet-scale to serve over 1,900 of the biggest internet retailers and brands.

We built our data pipelines all in the cloud using Apache Spark and Hive on AWS EC2 accessing data in S3. AWS enables us to scale “out” the infrastructure capacity effortlessly to keep up with the Internet-scale data and web traffic, but scaling out also exposes certain limitations like the ability to further scale “up”. While this cloud native stack is scalable and elastic we experience performance limitations, because data access is limited by the network bandwidth, and this is exacerbated for workloads that involve repeated queries.

To address the data access challenges, we leverage Alluxio, an open source data orchestration system for analytics in the cloud. Alluxio serves as a transparent caching layer for hot and warm data, such that Hive and Spark jobs are able to access all data transparently in S3. We have seen 10x performance acceleration of Spark and Hive jobs on S3 with Alluxio.

In this talk, we will cover:
the challenges associated with building an efficient cloud native big data analytics platform on AWS
the open source technologies we work with including Spark, Hive, and Alluxio,
how we set them up -benchmark results of micro and real-world workloads

Tim Kelly
Tim Kelly is a Sr. Engineering manager focused on data engineering leader at Bazaarvoice. He has extensive experience leading software teams and building high-quality software applications, services, and SDKs for consumer and enterprise.

Thai Bui
Thai Bui is a Sr. Staff Software Engineer at Bazaarvoice on the data engineering team. Thai is interested in scalable application development, big data, stats, and machine learning.

Talk 2: Getting started with Apache Spark and Alluxio for blazingly fast analytics

Apache Spark and Alluxio are cousin open source projects that originated from UC Berkeley’s AMPLab. Running Spark with Alluxio is a popular stack particularly for hybrid environments. In this session, I will briefly introduce Apache Spark and Alluxio, share the top ten tips for performance tuning for real-world workloads, and demo Alluxio with Spark.

Bin Fan, VP and founding engineer, Alluxio
Bin Fan is the founding member of Alluxio, Inc. and the PMC maintainer of Alluxio open source project. Prior to Alluxio, he worked for Google to build the next-generation storage infrastructure and won Google's Technical Infrastructure award. Bin received his Ph.D. in CS from CMU.