Metadata in Hadoop - Apache HCatalog with Hive, Pig, and MapReduce

I will walk through one of the hottest but often overlooked Apache Incubator projects: HCatalog. HCatalog is a metadata management tool that stores consistent schema information for Pig, Hive, MapReduce, and sometimes extra-Hadoop 3rd party systems that need to integrate with Hadoop.

http://incubator.apache.org/hcatalog/

The premise with Hadoop is that you land data without schema and then only apply schema only on read. The schema is never imposed permanently, rather it is virtual metadata for use in structured queries (like SQL via Hive) but the unstructured data need not be changed. HCatalog allows you to create, edit, and expose (via a REST API) this metadata or table definitions. This technology will also be seat of tracking data lineage and some authorization and authentication.

I will give a brief presentation, demonstration (with code), and provide some materials for anyone to try it out for themselves. The vm that we will work from will be the HDP 1.2 sandbox as it has HCatalog installed and a HUE graphical interface to help us work with it:

http://hortonworks.com/products/hortonworks-sandbox/

As always, please use whatever distro of Hadoop you like. HCatalog 0.4 works best with Hadoop 1.1.2.21.

 

https://bentomiso.com/about

HOW TO FIND US

We're at 862 Richmond Street West, about a block west of Walnut and south of Queen and Trinity-Bellwoods Park.

TTC: get off of the King or Queen streetcar at Strachan.

Driving: there's a Green P at Walnut, and some street parking. Note that Richmond runs one-way, west, from Niagara to Strachan.

 

Join or login to comment.

  • David T.

    Thanks Adam,

    The Sandbox provides a great tools to get exposed.

    I'm wondering if there are any blogs where Hadoop experts describe how they approach and work with a new data set to tease out insights such as coorelations, trends, etc.?

    I realize that this is very open-ended, but I'm comparing this to my own experience working with SQL over tables of manufacturing production data to compare things like order consumption versus inventory levels and capacities. Over time when presented with any oddball concern I simply had a good sense what data that I could tie together that might pop out a reasonable answer.

    Hive is reasonably close to SQL (and getting better), but it still leaves me with an initial feeling of walking with lead feet. A large part of that is probably because any query is slow (you'd talked last night about interactive SQL in future releases), so it isn't possible (?) to quickly peek at small samples of data.

    February 21, 2013

    • Adam M.

      Sometimes I sample the data and use GoogleRefine or R. As I get a better sense of what the data source is, I tend to just run common statistical sampling and clustering in Hadoop. That works only if you are used to the toolsets and have a cluster to work with.

      1 · February 21, 2013

    • David T.

      Thanks. I'll have to learn about the tools, but the addresses my concern.

      February 21, 2013

  • Guillaume D.

    thanks again Adam for last night's presentation, it was great!

    February 21, 2013

  • Adam M.

    Link for the slides used in this meetup:
    http://www.slideshare.net/adammuise/2013-feb-20thughcatalog-16671330

    HDP 1.2 Tutorials can be found on the Sandbox:
    http://hortonworks.com/products/hortonworks-sandbox/

    February 21, 2013

    • Hardik

      Awesome, thanks!!!

      February 21, 2013

  • Tri N.

    Excellent presentation. Adam was friendly and patient.

    February 21, 2013

  • Hardik

    Adam, can you please post yesterdays presentation? thanks

    February 21, 2013

  • Hardik

    As expected great session from Adam, looking forward for the next meetup in March

    February 21, 2013

  • Boris R.

    Excellent session - thanks to Adam.

    February 21, 2013

  • Adam M.

    This is shaping up to be a good number for hands on and discussion. Heading over very soon

    February 20, 2013

  • Rajiv A.

    Sorry, I am not feeling well

    February 20, 2013

  • Pankaj T.

    Sorry. I would have loved to join you but other meetup clashes with it.

    February 20, 2013

  • Ashish B.

    Apologies

    February 20, 2013

  • A former member
    A former member

    Sorry, I'm out of the country at this time.

    February 20, 2013

  • Venkat M.

    Sorry for missing the meetup.

    February 20, 2013

  • Barry S.

    Car is broken :(

    February 20, 2013

  • Ron M.

    Ugh.. Sorry can't make it!

    February 20, 2013

  • A former member
    A former member

    Hope ive freed up space for others! Hace an event clash that day.

    February 14, 2013

  • Adam M.

    This one should be straight forward. I'm going to give a presentation and then we will break off for some hands on based on a demo I will provide. We might need some help getting everyone's vm images sorted out but hopefully that's it. Lets touch base at the social.

    February 4, 2013

  • David T.

    Adam,

    In regards to volunteering, I'm planning to come to the Feb 4 Social; perhaps we could talk then in terms of what type of help is needed?

    January 28, 2013

34 went

Our Sponsors

  • IBM

    Meeting facilities, expert speakers, free product, books and education.

  • Big Data University

    Free on-line courses in Hadoop and big data related technologies.

  • Cloudera

    10% off training for Toronto Hadoop User Group members.

  • Hortonworks

    Food, speakers, beverages

  • T4G

    Hosting Meeting locations and providing relevant speakers

People in this
Meetup are also in:

Create your own Meetup Group

Get started Learn more
Rafaël

We just grab a coffee and speak French. Some people have been coming every week for months... it creates a kind of warmth to the group.

Rafaël, started French Conversation Group

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy