Incremental Processing in Hadoop

public group

Details

Hadoop's execution model runs a job under the assumption that "all" input needs to be processed for the job to reach its intended goal. If the input is large (scaling petabytes) this amounts to the need for acquiring a large number of map slots on the cluster. As hadoop clusters are typically shared amongst users, a job that requires a large number of map slots can incur large delays, negatively impact other concurrent jobs and bring down the throughput of the system, The assumption that all input would be required to produce the required result is not true for all kind of jobs, particularly exploratory analysis and approximate queries wherein partial analysis is required. I interned at Facebook in summer-2010 and worked on supporting incremental processing in Hadoop. With incremental processing, a job can add input as and when required. It may begin as a small job by choosing to process only a limited subset of data. As data flows through the runtime, useful statistics are evaluated that help in deciding if additional input needs to be processed for the job to reach its intended goal. Such a mechanism is useful in jobs that can potentially produce the required result by partial analysis of input data. The mechanism is governed by (user-defined) policies that dictate the job's expansion in accordance with the existing load on the cluster. I have been working on incremental processing and would like to share my work with you all. I have experimental results under both single and multi-user workload scenarios, varying sizes of input data and degree of inherent skew in data. I hope to get valuable feedback from the community. Regards, Raman

Events in