add-memberalign-toparrow-leftarrow-rightbellblockcalendarcamerachatchevron-leftchevron-rightchevron-small-downchevron-upcircle-with-crosscomposecrossfacebookflagfolderglobegoogleimagesinstagramkeylocation-pinmedalmoremuplabelShape 3 + Rectangle 1pagepersonpluspollsImported LayersImported LayersImported LayersshieldstartwitterwinbackClosewinbackCompletewinbackDiscountyahoo

An Introduction to Impala – Low Latency Queries for Apache Hadoop

The Cloudera Impala project is, for the first time, making scalable parallel database technology, which is the underpinning of Google's Dremel as well as that of commercial analytic DBMSs, available to the Hadoop community.

With Impala, the Hadoop community now has an open-sourced codebase that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators.

Matt Harris is a Systems Engineer at Cloudera where he supports organizations in their understanding and adoption of Hadoop. Prior to Cloudera, Matt was a Systems Engineer at Composite Software and a SCADA Engineer at Peoples Energy. Matt has an MS in Computer Science from DePaul University and a BS in Mechanical Engineering from Purdue University.

(I'm super excited about this talk! - matthew)

Join or login to comment.

  • Thomas C. M.

    Huge thank to Pitt Fagan for helping out with hosting the meeting. Matthew Rathbone was out of town and I had another commitment that night, so thank for helping out, Pitt!

    May 29, 2013

  • Brad L.

    Interesting project.

    May 28, 2013

  • Matt H.

    - What is meant by Fully shredded nested data?
    The Parquet Github site provides some great detail on this topic as well as an overview of the goals of Parquet.

    May 28, 2013

  • Matt H.

    - Differences between Parquet format and RCFile:
    RC Files use column compression, which results in better compression rates than block compression. However, the normal use of RC Files stores type information as a string and then compresses them. RC Files have more metadata than sequence files because an RC File documents the number of columns in the file. This number is fixed for the whole file. Note: there is no type information stored in metadata just the number of columns; type information is stored as part of the column.

    Parquet supports the same self-documenting metadata offered by Avro: every file is self-documenting and data is stored in its native format; however, column compression is used instead of block compression. Unlike RC Files, Parquet stores its types as native types allowing for better compression. Of all the file formats Parquet will be the best for speed and for compression.

    1 · May 28, 2013

  • Matt H.

    Thanks for allowing me the opportunity to present this evening. It was great meeting all of you and I appreciate all of your questions. There were a couple outstanding questions I wanted to respond to from the discussion:

    - Link to the Cloudera blog post including information on concurrent queries:

    More to follow...

    1 · May 28, 2013

  • Luke F.

    Impala is a very new technology and so it was great to have an expert take our questions.

    May 28, 2013

  • Lou I.

    Sorry ... Last minute conflict

    May 28, 2013

  • Kyle N.

    Can't make it tonight - sorry for the late notice.

    May 28, 2013

  • Luke F.

    Sounds great

    May 7, 2013

  • Matt H.

    Can't wait to see this in action

    May 6, 2013

40 went

Our Sponsors

  • Cloudera

    Cloudera is the general sponsor of Big Data Madison.

  • Hortonworks

    Hortonworks is sponsoring a round of after-meetup drinks.

People in this
Meetup are also in:

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy