I will walk through one of the hottest but often overlooked Apache Incubator projects: HCatalog. HCatalog is a metadata management tool that stores consistent schema information for Pig, Hive, MapReduce, and sometimes extra-Hadoop 3rd party systems that need to integrate with Hadoop.
The premise with Hadoop is that you land data without schema and then only apply schema only on read. The schema is never imposed permanently, rather it is virtual metadata for use in structured queries (like SQL via Hive) but the unstructured data need not be changed. HCatalog allows you to create, edit, and expose (via a REST API) this metadata or table definitions. This technology will also be seat of tracking data lineage and some authorization and authentication.
I will give a brief presentation, demonstration (with code), and provide some materials for anyone to try it out for themselves. The vm that we will work from will be the HDP 1.2 sandbox as it has HCatalog installed and a HUE graphical interface to help us work with it:
As always, please use whatever distro of Hadoop you like. HCatalog 0.4 works best with Hadoop 188.8.131.52.
HOW TO FIND US
We're at 862 Richmond Street West, about a block west of Walnut and south of Queen and Trinity-Bellwoods Park.
TTC: get off of the King or Queen streetcar at Strachan.
Driving: there's a Green P at Walnut, and some street parking. Note that Richmond runs one-way, west, from Niagara to Strachan.