Skip to content

June Hadoop Meetup: Dremel, Hive & Pig

Photo of Sebastian Spiegler
Hosted By
Sebastian S.
June Hadoop Meetup: Dremel, Hive & Pig

Details

Dear HUG UK members,

I am pleased to announce our June meetup on Dremel, Hive and Pig.

This event will be at the TecHub @ Campus.

Details below.

Sebastian

TIME:

Wednesday June 5th 2013, Doors Open 6:30pm.

Presentations 7:00pm – 8:30pm.

LOCATION:

TechHub @ Campus

5 Bonhill St, London, EC2A 4BX

AGENDA:

Session 1: Dremel: Interactive Analysis of Web-Scale Datasets

Speaker: Amanda Waite, Developer Advocate at Google

Abstract: Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this talk, Mandy Waite, Developer Advocate at Google, will describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing.

Short bio: Amanda is a Developer Advocate for the Google Cloud Platform. Before that, she worked in various key software engineering roles at Oracle, Sun Microsystems and Kodak.

Session 2: HCatalog

Speaker: Alan Gates

Abstract: HCatalog opens up Hive's metastore to tools inside the Hadoop system such as Pig and MapReduce and to external systems. This allows other Hadoop tools to view data through a table abstraction. It also opens up Hadoop's data to other data processing systems that are accustomed to viewing data in a tabular format. This talk will introduce HCatalog, discuss how it fits with Hive, Pig, and MapReduce, and discuss future plans for the system.

Short bio: Alan is a co-founder of Hortonworks and an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan also designed HCatalog and guided its adoption as an Apache Incubator project. Alan has a BS in Mathematics from Oregon State University and a MA in Theology from Fuller Theological Seminary. He is also the author of Programming Pig, a book from O’Reilly Press.

Session 3: ORC File - Improving Hive Data Storage

Speaker: Owen O'Malley

Abstract: Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition.

Short bio: Owen has been contributing to Apache Hadoop since before it was called Hadoop. He was the first committer added to the project and has provided technical leadership on MapReduce, and security. Using Hadoop, in 2008 he set the world record for sorting a terabyte of data in 3.5 minutes and in 2009 he sorted a petabyte in 16.25 hours. He was also the founding chair of the Apache Hadoop Project Management Committee. For the last year, he has been working on Hive. He has a PhD in Software Engineering from the University of California, Irvine. Owen may be followed on Twitter: @owen_omalley.

Photo of AI Users Group UK group
AI Users Group UK
See more events
Techhub @ Campus
4-5 Bonhill St · London EC2A 4BX