The challenge of the on-line serving massive batch-computed data sets. Technolog


Details
Architecture of Similar Web (http://www.similarweb.com) service includes calculating various statistics and information about traffic of each site in the world. Information about each site have to be served very quickly, so some kind of key-value NoSQL is needed because of high availability and scalability needs. MapReduce jobs are used to calculate this data periodically. There is a challenge in inserting billions of records into NoSQL systems. We will get over company's data flow in order to understand the context of the problem.
In this meetup we are going to cover the following aspects of our relevant experience in SimilarWeb:
• What are problems in massive data inserts into HBase and what can be done "out of the box"?
• Why do we think that it is inherently wrong to work with massive random inserts for the batch computed data?
• Usage of the HBase snapshots to partially solve the problem.
• The Idea of offline index building and online serving. Evaluation process to select technology to support this.
• Evaluation of the Project Voldemort. A bit of architecture, support for offline data building, performance evaluation.We are in the middle of the evaluation process and will disclose all our findings.
The meetup will be technical and is aimed to the people who is interested in NoSQL technologies under heavy insertion load and the ways to resolve it.
Content of this article has serious intersection with our meetup, so we suggest to read it before: Serving Large-scale Batch Computed Data with Project Voldemort (http://engineering.linkedin.com/voldemort/serving-large-scale-batch-computed-data-project-voldemort)
If you have related questions - please put them here, I will try to cover them in the presentation.

The challenge of the on-line serving massive batch-computed data sets. Technolog