Skip to content

Spark Data Sources: overview of API & HBase data source from Huawei

Photo of Andy Konwinski
Hosted By
Andy K.
Spark Data Sources: overview of API & HBase data source from Huawei

Details

At this meetup we will host two technical talks about Spark Data Sources. One talk will be by Yan Zhou, an Architect on the Huawei Big Data team, about HBase as a Spark SQL Data Source. The other talk will be by Yin Huai, a Software Engineer at Databricks, about the Spark SQL Data Sources API. Find details about both talks below.

Schedule for the evening:
6:30 - 7:00 :: Mingling
7:00 - 8:15 :: Talks
8:15 - 9:00 :: Mingling

The talks will be live streamed, and the video will be published on the Apache Spark channel (https://www.youtube.com/user/TheApacheSpark) on YouTube.

--------------

Talk Title: HBase as data source to Spark SQL

Speaker: Yan Zhou. Architect, Huawei Big Data team

Abstract:
In this talk, we’ll discuss technical designs of support of HBase as a “native” data source to Spark SQL to achieve both query and load performance and scalability: near-precise execution locality of query and loading, fine-tuned partition pruning, predicate pushdown, plan execution through coprocessor, and optimized and fully parallelized bulk loader. Point and range queries on dimensional attributes will benefit particularly well from the techniques. Benchmark results vs. established SQL-on-HBase technologies will be provided. The speaker will also share the future plan and real-world use cases, particularly in the telecom industry.

About Yan Zhou:
As lead architect at Huawei Big Data team, Yan Zhou is responsible for the Spark open source project at Huawei and manages the architecture design of vertical solutions. During his tenure at Yahoo, Yan led the design and implementation of Yahoo’s distributed petabyte SQL query engine (Myna project), several significant enhancements on Apache Pig, and Apache Zebra of Yahoo big data platform. Prior to Yahoo, Yan was senior principle engineer at Hyperion/Oracle. Yan has over 17 years of experience with Hadoop, distributed SQL and BI. He is also an Apache Pig Committer.

=====================

Talk Title: Data Source API in Spark SQL

Speaker: Yin Huai. Software Engineer, Databricks

Abstract:
The Spark SQL Data Source API is a convenient feature that enables users to easily connect to their data stored in different formats and systems with Spark SQL. Equipped with the Data Source API and SQL, users can start manipulating data with minimal setup and configuration. In this talk, I will first introduce the Data Source API and the unified save/load interfaces built on top of it, which significantly simplify the process of saving/loading data to and from various sources. Then, I will use an example to demonstrate how you can connect your own data source to Spark SQL through the Data Source API.

About Yin Huai:
Yin Huai is a Software Engineer at Databricks and mainly works on Spark SQL. Before joining Databricks, he was a PhD student at The Ohio State University and was advised by Xiaodong Zhang. His interests include storage systems, database systems, and query optimization. He is also an Apache Hive committer.

Photo of Bay Area Spark Meetup group
Bay Area Spark Meetup
See more events
Huawei Technologies (USA)
2330 Central Expressway · Santa Clara, CA