Divide and Recombine (D&R) for Large Complex Data

  • October 22, 2013 · 6:30 PM

We have two awesome people coming this month.  Our speaker is Bill Cleveland, progenitor of the term data science, inventor of the loess curve and popularizer of trellis charts.  He will be preceded by a brief introduction to the International Year of Statistics by Ronald Wasserstein, Executive Director of the American Statistical Association.

About the talk:

D&R is a new approach to the analysis of large complex data. The goals are the following: (1) Provide the data analyst with methods and a computational environment that enable study of large data with almost the same comprehensiveness and detail that we can small data. (2) The analyst uses an interactive language for data analysis that is both highly flexible and enables highly time-efficient programming with the data.(3) Underneath the language, a distributed database and parallel compute engine makes computation feasible and practical, and is easily addressable from within the language. (4) The environment provides access to the 1000s of analytic methods of statistics, machine learning, and visualization.(5) Get a reasonable first system going right away.

In D&R, the analyst divides the data into subsets. Computationally,each subset is a small dataset. The analyst applies analytic methods to each of the subsets, and the outputs of each method are recombined to form a result for the entire data.  Computations can be run in parallel with almost no communication among them, making them nearly embarrassingly parallel, the simplest possible parallel processing.One of our D&R research thrusts uses statistics to develop ``best''division and recombination methods for analytic methods. This is critical because the division and recombination methods have an immense impact on the statistical accuracy of the D&R result for an analytic method. Another thrust is a D&R computational environment that has two widely used components, the R interactive environment for data analysis,and the Hadoop distributed database and parallel compute engine. Our RHIPE merger of them passes the embarrassingly parallel D&R computations off to Hadoop. The analyst programs this wholly from within R, insulated from the complexity of distributed database management and parallel computation.

About Bill Cleveland:

William S. Cleveland is the Shanti S. Gupta Distinguished Professor ofStatistics and Courtesy Professor of Computer Science at Purdue University.

His areas of methodological research are in statistics, machine learning,and data visualization. He has analyzed data sets ranging from small tolarge and complex in his research in cyber security, computer networking,visual perception, environmental science, healthcare engineering, publicopinion polling, and disease surveillance.

In the course of this work, Cleveland has developed many new methods and modelsfor data that are widely used throughout the worldwide technical community. Hehas led teams developing software systems implementing his methods that havebecome core programs in many commercial and open-source systems.

In 1996 Cleveland was chosen national Statistician of the Year by the ChicagoChapter of the American Statistical Association.  In 2002 he was selected asa Highly Cited Researcher by the American Society for Information Science &Technology in the newly formed mathematics category. He is a Fellow of the American Statistical Association, the Institute of Mathematical Statistics, the American Association of the Advancement of Science, and the International Statistical Institute.

Today, Cleveland and colleagues are developing the Divide & Recombine approach to large complex data. Each analytic method is applied independently to each subset in a division of the data into subsets. Then outputs are recombined. Thisenables a data analyst to carry out detailed, comprehensive analysis of bigdata, to use only the R interactive software environment, and to easily run analytic methods in parallel. This is achieved through (1) statistics research to find D&R division and recombination methods that give high statistical accuracy; (2) development of a D&R computational environment that merges R with the Hadoop distributed database and distributed parallel compute engine.

Pizza begins at 6:30, the talks at 7 and then we will go to a nearby bar.

Join or login to comment.

  • Zac P.

    The speaker was very knowledgable and the software he discussed is very innovative.

    November 12, 2013

  • Dan B.

    Would have loved a whole session with each of the speakers, but with the double header we got good lift on value per minute.

    October 23, 2013

  • Andrew L.

    This talk was an interesting window into very serious, highly technical work on data computation and analysis. However, it would have been better if Bill had talked about the kind analysis that would benefit from his methodology and used an example to contrast his process with the way that data of large size and complexity is currently processed. Jared was a great host, as usual.

    October 23, 2013

  • Douglas B S.

    I wanted to see more examples with cost benefit results. Maybe a RHIPE hello world with sample code.

    October 23, 2013

  • Christina G.

    would love to hear this one but have a previous obligation; will look forward to material posted afterward?

    October 20, 2013

    • Jared L.

      We post slides and video whenever possible.

      1 · October 20, 2013

Your organizer's refund policy for Divide and Recombine (D&R) for Large Complex Data

Refunds are not offered for this Meetup.

Our Sponsors

People in this
Meetup are also in:

Create a Meetup Group and meet new people

Get started Learn more

Meetup has allowed me to meet people I wouldn't have met naturally - they're totally different than me.

Allison, started Women's Adventure Travel

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy