Past Meetup

Big Data Application Meetup 01/27

This Meetup is past

165 people went

Details

Shout out to Ampool (http://www.ampool.io/) for kindly sponsoring this meetup!

AGENDA

6:00 - 6:30 - Socialize over food and beer(s)

6:30 - 8:00 - Talks

TALKS

Talk #1: Simplifying big data analytics with Apache Kudu, by Mike Percy, Cloudera

Talk #2: SQL-on-Everything with Apache Drill, by Julien Le Dem, Dremio

Talk #3: Apache Phoenix: OLTP in Hadoop, by James Taylor, Saleforce.com

ABSTRACTS

Talk #1: Simplifying big data analytics with Apache Kudu,

by Mike Percy, Cloudera

The Hadoop ecosystem has been making great strides in recent years. With systems such as Apache HBase and Apache Cassandra, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. However, the scan performance of these systems is not optimal.

On the other end of the spectrum, columnar file formats such as Apache Parquet and Hive ORCFile are designed for very fast scan rates, offering great performance benefits to many SQL and analytics applications. Unfortunately, there is little to no ability for real-time modification or row-by-row indexed access when using these file formats.
Kudu was designed from the ground up to address this gap. Kudu offers real-time random read / write access to records, while also storing data in a columnar format, providing both exceptional scan performance and competitive random access performance, combining many of the benefits of the above systems and formats. This talk will discuss how Kudu can be used as a single storage system to greatly simplify analytical big data applications.

Talk #2: SQL-on-Everything with Apache Drill,

by Julien Le Dem, Dremio

In recent years, the rise of modern, non-relational datastores such as NoSQL databases, Hadoop and cloud storage has made it easier for developers to build and scale applications. However, these datastores make it harder for business users and analysts to analyze the data. In many cases, data engineers must develop complex ETL pipelines, loading the data into a centralized relational database or SQL-on-Hadoop environment.

Apache Drill is an open source, in-memory, columnar SQL execution engine. It enables users and BI tools to execute large-scale, interactive SQL queries against one more datastores. Drill supports NoSQL databases (eg, MongoDB, HBase, Kudu), search (eg, Elasticsearch, Solr), file systems (eg, HDFS, NAS), cloud storage (eg, Amazon S3, Azure Blob Storage) and relational database (eg, MySQL, Oracle). Users can run queries on a single system or join data between multiple systems. For example, a user can join log files in Elasticsearch with user profiles in MySQL or even an Excel spreadsheet.

In this talk we provide an overview of Apache Drill, and explain how to use it to query data in one or more datastores, with a particular emphasis on modern, non-relational datastores.

Talk #3: Apache Phoenix: OLTP in Hadoop,

by James Taylor, Saleforce.com

This talk will examine how Apache Phoenix, a top level Apache project, differentiates itself from other SQL solutions in the Hadoop ecosystem. It will start with exploring some of the fundamental concepts in Phoenix that lead to dramatically better performance and explain how this enables support of features such as secondary indexing, joins, and multi-tenancy. Next, an overview of ACID transactions, a new feature available in our 4.7.0 release, will be given along with an outline of the integration we did with Tephra to enable this new capability. This will include a demo to demonstrate how Phoenix can be used seamlessly in CDAP. The talk will conclude with a discussion of some in flight work to move on top of Apache Calcite to improve query optimization, broaden our SQL support, and provide better interop with other projects such as Drill, Hive, Kylin, and Samza.

SPEAKER BIOS

• Mike Percy is software engineer at Cloudera who has been working since 2013 on Kudu, an open source distributed column store for the Hadoop ecosystem that has recently joined the Apache incubator. He is also a committer and PMC member on Apache Flume. Prior to joining Cloudera in 2012, Mike worked at Yahoo! building machine learning infrastructure for big data. Mike holds an MSCS from Stanford University and a BSCS from UC Santa Cruz.

• Julien Le Dem is the co-author of Apache Parquet and the PMC Chair of the project. He is also a committer and PMC Member on Apache Pig. Julien is an architect at Dremio, and was previously the Tech Lead for Twitter’s Data Processing tools, where he also obtained a two-character Twitter handle (@J_). Prior to Twitter, Julien was a Principal engineer and tech lead working on Content Platforms at Yahoo! where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

• James Taylor is an architect at Salesforce in the Data Platform and Services Cloud. He leads the Apache Phoenix project, an OLTP database for Hadoop, and is a PMC member of Apache Calcite and the Apache Incubator. Prior to working at Salesforce, James worked at BEA Systems on projects such as federated query processing systems and event driven programming platforms and has worked at various other start-ups in the computer industry over the past 20 years.

ARRIVAL AND PARKING

Cask HQ is a few minutes walk from the California Avenue Caltrain Station.

Also, Cask HQ has its own parking lot, but it will certainly not accommodate all guests. Please use parking lots available nearby: