San Francisco Hadoop Users Message Board › Mar 2011 - schema and metadata management
|A former member||
* Evolvable schemas add complexity to data processing, but are necessary
* Good idea for file-based data: use convention to decide where the schema goes. (e.g., have "foo.schema" to correspond to "foo.txt") Have a file just named ".schema" that describes the schema that applies to all files in the directory.
* HBase: Need to store a (pointer to the) schema for each cell, if cells can evolve independently.
** Each data cell has a companion schema pointer cell, that provides some id (MD5, counter, etc) that references the schema
** Another table / column family holds the actual JSON schema text itself (for Avro), rows are referenced by that id.
** HAvroBase provides functionality built in roughly this fashion, for storing Avro-encoded data in HBase