Hadoop's execution model runs a job under the assumption that "all" input needs to be processed for the job to reach its intended goal. If the input is large (scaling petabytes) this amounts to the need for acquiring a large number of map slots on the cluster. As hadoop clusters are typically shared amongst users, a job that requires a large number of map slots can incur large delays, negatively impact other concurrent jobs and bring down the throughput of the system, The assumption that all input would be required to produce the required result is not true for all kind of jobs, particularly exploratory analysis and approximate queries wherein partial analysis is required. I interned at Facebook in summer-2010 and worked on supporting incremental processing in Hadoop. With incremental processing, a job can add input as and when required. It may begin as a small job by choosing to process only a limited subset of data. As data flows through the runtime, useful statistics are evaluated that help in deciding if additional input needs to be processed for the job to reach its intended goal. Such a mechanism is useful in jobs that can potentially produce the required result by partial analysis of input data. The mechanism is governed by (user-defined) policies that dictate the job's expansion in accordance with the existing load on the cluster. I have been working on incremental processing and would like to share my work with you all. I have experimental results under both single and multi-user workload scenarios, varying sizes of input data and degree of inherent skew in data. I hope to get valuable feedback from the community. Regards, Raman
I decided to start Reno Motorcycle Riders Group because I wanted to be part of a group of people who enjoyed my passion... I was excited and nervous. Our group has grown by leaps and bounds. I never thought it would be this big.
— Henry, started Reno Motorcycle Riders