It’s Two O’clock in the Morning: Do You Know Where Your Petabytes Are?
Robert Chansler, Engineering Manager, Grid Computing, Yahoo!
The Hadoop Distributed File System at Yahoo! stores 25 petabytes across 25 thousand nodes. To be a good custodian of this much data requires continuous surveillance and management to ensure the integrity and durability of the data. Importantly, the most conventional strategy for data protection—just make a copy somewhere else—is not practical for such large data sets. HDFS must continuously manage the number of replicas for each block, test the integrity of blocks, balance the usage of resources as the hardware infrastructure changes, report status to administrators, and be on guard for the unexpected.
SPEAKER BIO: Rob Chansler is a veteran of the Cm* and C.mmp projects at CMU. After finishing graduate studies, he made compilers in Pittsburgh at Tartan Labs. At Adobe Systems Rob joined the core Postscript group to develop products for high-end and specialty systems. Rob joined the dot-com world just in time to experience the crash before moving to McDATA to do management software for storage area networks. Now at Yahoo, Rob manages development for the Hadoop Distributed File System.