We have three great talks from visiting contributors/committers who are in town for ApacheCon.
Chris Fregly (IBM)
Real-time, Advanced Analytics and Recommendations using ML, Graph Processing, NLP, and Approximations (featuring Apache Spark, Stanford CoreNLP, and Twitter Algebird)
Starting with a live, interactive demo generating audience-specific recommendations, we'll dive deep into each of the key components including NiFi, Kafka, Stanford CoreNLP, Docker, Word2Vec, LDA, Twitter Algebird, Spark Streaming, SQL, ML, GraphX. As a bonus, we'll discuss the latest Netflix Recommendations Pipeline and related open source projects.
Mike Percy (Cloudera) & Dan Burkert (Cloudera)
Kudu and Spark for Fast Analytics on Streaming Data
Apache Kudu (incubating) is a new storage engine for the Apache Hadoop ecosystem that enables extremely high-speed analytics without imposing data-visibility latencies. Using Apache Spark and Kudu, we show that it is now easy to create applications that query and analyze mutable, constantly changing datasets while getting the impressive query performance that you would normally expect from immutable columnar data formats like Apache Parquet and ORCFile. Kudu delivers this with a fault-tolerant, Spanner-like distributed architecture and a columnar on-disk storage format. This talk provides an introduction to Kudu and demonstrates using Spark and Kudu together to achieve impressive results in a system that is friendly to both app developers and operations engineers.
Xuefu Zhang (Uber)
Hive on Spark, an Uber Use Case
As Hive on Spark has been mature and production ready, Hive community has seen exciting user adoption. Uber has recently built up its Hadoop based data lake, Hive is extensively used to support ETL, BI, and analytics workloads. As the data size as well as user base increases, faster Hive for the same set of workloads is desired. Hive on Spark has demonstrated great potential to meet the need. Here Uber's experience with Hive and Hive on Spark is shared.
Big Thanks to Cloudera for sponsoring the evening.
6:30 Chris Fregly
7:00 Mike Percy & Dan Burkert
7:30 Xuefu Zhang
8:00 Networking and wrap
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
Mike Percy is a software engineer at Cloudera and a committer on Apache Kudu (incubating). Prior to joining Cloudera, Mike worked on big data infrastructure for machine learning at Yahoo! Mike holds a BSCS from UC Santa Cruz and an MSCS from Stanford.
Dan Burkert is a software engineer at Cloudera and committer on Apache Kudu (incubating). Prior to joining Cloudera, Dan worked on data processing pipelines for machine learning, search, and analytics. Dan received his bachelor’s degree from the University of Virginia.
Xuefu Zhang has over 10 year’s experience in software development. Earlier this year he joined as a software engineer in Uber from Cloudera, where he spent his main efforts on Apache Hive and Pig. He also worked in the Hadoop team at Yahoo when the majority of the development on Hadoop was still there. In addition, he spent his early career at Informatica, gaining important experience on enterprise software development, especially in ETL and data warehouse. Xuefu Zhang is an Apache member, PMC member for Pig, Hive, and Sentry, and is the tech lead for both Hive on Spark and Pig on Spark projects.