BlinkDB: Querying Petabytes of Data in Seconds

Name: BlinkDB: Querying Petabytes of Data in Seconds
Start: 2013-10-30T18:30:00-07:00
End: 2013-10-30T21:30:00-07:00
Location: Intel Corporation

Hosted by Reynold X. and Andy K.

Bay Area Spark Meetup

Details

There is an exponential growth in data that is being collected and stored. This has created an unprecedented demand for processing and analyzing massive amounts of data. Furthermore, analysts and data scientists want results fast to enable explorative data analysis, while more and more applications require data processing to happen in near real time.

In this talk, we present BlinkDB, which uses a radically different approach where queries are always processed in near real time, regardless of the size of the underlying dataset. This is enabled by not looking at all the data, but rather operating on statistical samples of the underlying datasets. More precisely, BlinkDB gives the user the ability to trade between the accuracy of the results and the time it takes to compute queries. The challenge is to ensure that query results are still meaningful, even though only a subset of the data has been processed. Here we leverage recent advances in statistical machine learning and query processing. Using statistical bootstrapping, we can resample the data in parallel to compute confidence intervals that tell the quality of the sampled results. To compute the sampled data in parallel, we build on the Shark distributed query engine, which can compute tens of thousands of queries per second.

This talk will feature an overview of the BlinkDB architecture, its design philosophy and a brief tutorial of how the audience can leverage this new technology to gain insights in real-time.

Please use this map for direction: https://www.google.com/maps/ms?msid=204121981109969386836.0004e86a9b1f2126abd3b&msa=0

The talk will be delivered by Sameer Agarwal.

Sameer Agarwal is a fifth-year Ph.D. student in the AMPLab working on large-scale approximate query processing frameworks with Ion Stoica. His research interests are at the intersection of distributed systems, databases and machine learning. In the recent past, he has worked in designing dynamic parallel query optimization frameworks, building snapshot managers and proactive file replication schemes for distributed file systems, and exploring the usefulness of stateless packet classification protocols in datacenters. He received his B.Tech in Computer Science and Engineering from the Indian Institute of Technology, Guwahati and was awarded the President of India Gold Medal in 2009. He was supported by the Qualcomm Innovation Fellowship during 2012-13 and the Facebook Graduate Fellowship during 2013-14.

Bay Area Spark Meetup

BlinkDB: Querying Petabytes of Data in Seconds

Bay Area Spark Meetup

Details

Related topics

You may also like