The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Hadoop cluster. With Impala, the Hadoop community now has an open-sourced codebase that helps users query data stored in HDFS and Apache HBase in real time, using familiar SQL syntax. In contrast with other SQL-on-Hadoop initiatives, Impala's operations are fast enough to do interactively on native Hadoop data rather than in long-running batch jobs. Now you have the freedom to discover relationships and explore what-if scenarios on Big Data datasets. By taking advantage of Hadoop's infrastructure, Impala lets you avoid traditional data warehouse obstacles like rigid schema design and the cost of expensive ETL jobs.
This talk starts out with an overview of Impala from the user's perspective, followed by a presentation of Impala's architecture and implementation. It concludes with a summary of Impala's benefits when compared with Apache Hive, commercial MapReduce alternatives, and traditional data warehouse infrastructure.
About the speaker: Mark Grover is a Software Developer at Cloudera and is involved in Impala's packaging, Apache Hive and Apache Bigtop. He is also a section author of O'Reilly's Programming Hive book. He has a degree in Computer Engineering from University of Toronto. He has written a few guest blog posts and presented at many conferences about technologies in the hadoop ecosystem.