Skip to content

What we’re about

Hive is a scalable data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called HiveQL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce fromwork to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. The biggest Hive deployment to date is the silver cluster at Facebook Inc, which consists of 1100 nodes with 8 CPU cores and 12 1TB-disk each. This turns into a cluster of 8800 CPU cores and 13PB of raw storage. Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats. Please see File Format and SerDe in Developer Guide for details. http://wiki.apache.org/hadoop/Hive