BlinkDB: Querying Petabytes of Data in Seconds

There is an exponential growth in data that is being collected and stored. This has created an unprecedented demand for processing and analyzing massive amounts of data. Furthermore, analysts and data scientists want results fast to enable explorative data analysis, while more and more applications require data processing to happen in near real time.

In this talk, we present BlinkDB, which uses a radically different approach where queries are always processed in near real time, regardless of the size of the underlying dataset. This is enabled by not looking at all the data, but rather operating on statistical samples of the underlying datasets. More precisely, BlinkDB gives the user the ability to trade between the accuracy of the results and the time it takes to compute queries. The challenge is to ensure that query results are still meaningful, even though only a subset of the data has been processed. Here we leverage recent advances in statistical machine learning and query processing. Using statistical bootstrapping, we can resample the data in parallel to compute confidence intervals that tell the quality of the sampled results. To compute the sampled data in parallel, we build on the Shark distributed query engine, which can compute tens of thousands of queries per second.

This talk will feature an overview of the BlinkDB architecture, its design philosophy and a brief tutorial of how the audience can leverage this new technology to gain insights in real-time.


Please use this map for direction: https://www.google.com/maps/ms?msid=204121981109969386836.0004e86a9b1f2126abd3b&msa=0


The talk will be delivered by Sameer Agarwal.

Sameer Agarwal is a fifth-year Ph.D. student in the AMPLab working on large-scale approximate query processing frameworks with Ion Stoica. His research interests are at the intersection of distributed systems, databases and machine learning. In the recent past, he has worked in designing dynamic parallel query optimization frameworks, building snapshot managers and proactive file replication schemes for distributed file systems, and exploring the usefulness of stateless packet classification protocols in datacenters. He received his B.Tech in Computer Science and Engineering from the Indian Institute of Technology, Guwahati and was awarded the President of India Gold Medal in 2009. He was supported by the Qualcomm Innovation Fellowship during[masked] and the Facebook Graduate Fellowship during 2013-14.


Join or login to comment.

  • Burt P.

    Yes! Just registered for the Spark Summit!! Can't wait for December!
    Thanks for the discount code Reynold!

    October 31, 2013

  • Rahul C.

    The title should be changed to Querying 1% sample of Petabytes of Data in Seconds with error bar. Looking at the title my expectations were really high from this talk.

    October 30, 2013

    • Sameer A.

      Rahul, point well taken. The reason we decided to go with the title was because a central point of the talk was to highlight that there can only be two ways by which one can receive sub-second response times on petabytes of data-- either by using orders of magnitude of more resources (assuming memory bandwidths of the order of 20GB/s, just scanning a petabyte of data in memory would require a cluster of 50,000+ machines) or by keeping pre-computed summaries. Sampling is just one such way of keeping a pre-computed summary that is "lossy" (i.e., you lose accuracy) but is extremely "general" (i.e., a sample can be used to answer an extremely wide set of queries).

      October 31, 2013

    • Rahul C.

      Thanks for the comment Sameer. I appreciate you taking the feedback positively. I would love to see more into BlinkDB like count(distinct) etc. Also using Count Min Sketches to do little more complex statistical calculations will be a great addition.

      October 31, 2013

  • Khanderao

    Thanks Sameer for presentation . We may need a follow on session to go deeper both from sampling as well as architecture point of view

    October 31, 2013

  • Chandrajith U.

    Thanks for the excellent presentation, and taking all the questions from the audience patiently, and on the spot. Very good interactive session.

    October 31, 2013

  • Arun P.

    Expected a deeper discussion.

    October 31, 2013

  • Arun P.

    I was hoping the talk will go much deeper into the topic since the papers have been out for a while, however this was merely an overview/motivation to the subject.

    October 31, 2013

  • Burt P.

    Awesome talk!

    October 31, 2013

  • Sameer A.

    Here are the slides for yesterday's talk!: http://www.cs.berkeley.edu/~sameerag/BlinkDB-SparkMeetup.pptx

    October 31, 2013

  • Christopher N.

    Excellent work, great talk. For better statistics insight, it would be good to analogize to electoral polling, so the audience would intuitively "get" various concepts about sampling errors, how they are arrived at, why they might be independent of population size, etc.

    2 · October 30, 2013

    • Sameer A.

      Thanks Chris. This is a great suggestion!

      1 · October 31, 2013

  • Anthony Y.

    Forgot the take down the discount code for summit. Can someone post?

    October 30, 2013

  • Bikas S.

    Will there be a webcast available for remote folks? If not, will a recording be available later on?

    October 30, 2013

People in this
Meetup are also in:

Sometimes the best Meetup Group is the one you start

Get started Learn more
Katie

I'm surprised by the level of growth I've seen since becoming an organizer, it's given me more confidence in my abilities.

Katie, started NYC ICO

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy