Past Meetup

Special HUG event at Linkedin

This Meetup is past

300 people went

Location image of event venue


Join us at LinkedIn for a special event for Hadoop Users
Group showcasing two new Apache Incubator projects: Tajo and Samza. Tajo, a low-latency SQL query engine, and Samza, a distributed, reliable stream processing framework are both built atop Apache Hadoop YARN.

Doors open at six with socializing until 6:30. Pizza and beverages.
Look for the signs to the Unite presentation hall in building 2025.

6:00 doors open
6:30 Apache Tajo: A Big Data Warehouse on Hadoop
7:20 Apache Samza: Reliable Stream Processing with YARN and Kafka

Overview of the talks:

Apache Tajo: A Big Data Warehouse on Hadoop.
It is designed for low-latency and scalability, ad-hoc queries, and ETL on large-scale data sets. Tajo takes advantages of both advanced database techniques and MapReduce without sharing their shortcomings. It makes use of HDFS as a primary storage and it¹s own distributed query execution engine instead of MapReduce.

This talk is about an introduction to Tajo project and internal
architecture of Tajo. Tajo supports ANSI SQL and user-defined
functions. Tajo has the cost-based join optimizer and extensible query rewrite engine to find better plans. In terms of distributed
processing, it uses the DAG-based execution framework with various shuffle methods, such as range and hash. For scheduling, Tajo is designed to consider disk volumes of each node to improve scan throughput on disks significantly. Also, we will introduce more
planning opportunities led by a combination of various physical
operators and shuffle methods. Next, we will introduce Tajo's roadmap. Finally, as a case study, Jeong-shik will share some of findings related to Tajo's performance in an on-going project with SKT, a telco in Korea.

Apache Samza: Reliable Stream Processing with YARN and Kakfa
LinkedIn has recently open sourced Samza, its stream processing framework built on top of Apache Hadoop and YARN. Samza provides the ability to process infinite streams of data. Samza has a simple API, provides state management for the tasks, ensures falt tolerance and is very pluggable.

Speaker bios:

Hyunsik Choi, Ph.D., is one of committer and PPMC members on Apache Tajo. He is a director of research at Gruter which is a big data company located in South Korea, and he have contributed to query plan optimizer and vectored query engine using modern hardware for Tajo. Recently, he has interests in runtime query compilation techniques using LLVM and modern hardware features.

Jeong-shik Jang is VP at Gruter. He previously worked at Yahoo for five and a half years in Asia Search Engineering. Since he joined Gruter 3 years ago, he has been enjoying various and exciting roles as a project manager, a hands-on engineer, and a decision maker for management.

Chris Riccomini is a staff software engineer at LinkedIn where he has been the principal developer of Samza, contributed to the Hadoop ecosystem, worked on LI's RPC system and built the internal analytics/reporting tool.