Skip to content

Divide and Recombine (D&R) for Large Complex Data

Photo of Jared Lander
Hosted By
Jared L.
Divide and Recombine (D&R) for Large Complex Data

Details

We have two awesome people coming this month. Our speaker is Bill Cleveland, progenitor of the term data science, inventor of the loess curve and popularizer of trellis charts. He will be preceded by a brief introduction to the International Year of Statistics by Ronald Wasserstein, Executive Director of the American Statistical Association.

About the talk:

D&R is a new approach to the analysis of large complex data. The goals are the following: (1) Provide the data analyst with methods and a computational environment that enable study of large data with almost the same comprehensiveness and detail that we can small data. (2) The analyst uses an interactive language for data analysis that is both highly flexible and enables highly time-efficient programming with the data.(3) Underneath the language, a distributed database and parallel compute engine makes computation feasible and practical, and is easily addressable from within the language. (4) The environment provides access to the 1000s of analytic methods of statistics, machine learning, and visualization.(5) Get a reasonable first system going right away.

In D&R, the analyst divides the data into subsets. Computationally,each subset is a small dataset. The analyst applies analytic methods to each of the subsets, and the outputs of each method are recombined to form a result for the entire data. Computations can be run in parallel with almost no communication among them, making them nearly embarrassingly parallel, the simplest possible parallel processing.One of our D&R research thrusts uses statistics to develop ``best''division and recombination methods for analytic methods. This is critical because the division and recombination methods have an immense impact on the statistical accuracy of the D&R result for an analytic method. Another thrust is a D&R computational environment that has two widely used components, the R interactive environment for data analysis,and the Hadoop distributed database and parallel compute engine. Our RHIPE merger of them passes the embarrassingly parallel D&R computations off to Hadoop. The analyst programs this wholly from within R, insulated from the complexity of distributed database management and parallel computation.

About Bill Cleveland:

William S. Cleveland is the Shanti S. Gupta Distinguished Professor ofStatistics and Courtesy Professor of Computer Science at Purdue University.

His areas of methodological research are in statistics, machine learning,and data visualization. He has analyzed data sets ranging from small tolarge and complex in his research in cyber security, computer networking,visual perception, environmental science, healthcare engineering, publicopinion polling, and disease surveillance.

In the course of this work, Cleveland has developed many new methods and modelsfor data that are widely used throughout the worldwide technical community. Hehas led teams developing software systems implementing his methods that havebecome core programs in many commercial and open-source systems.

In 1996 Cleveland was chosen national Statistician of the Year by the ChicagoChapter of the American Statistical Association. In 2002 he was selected asa Highly Cited Researcher by the American Society for Information Science &Technology in the newly formed mathematics category. He is a Fellow of the American Statistical Association, the Institute of Mathematical Statistics, the American Association of the Advancement of Science, and the International Statistical Institute.

Today, Cleveland and colleagues are developing the Divide & Recombine approach to large complex data. Each analytic method is applied independently to each subset in a division of the data into subsets. Then outputs are recombined. Thisenables a data analyst to carry out detailed, comprehensive analysis of bigdata, to use only the R interactive software environment, and to easily run analytic methods in parallel. This is achieved through (1) statistics research to find D&R division and recombination methods that give high statistical accuracy; (2) development of a D&R computational environment that merges R with the Hadoop distributed database and distributed parallel compute engine.

Pizza begins at 6:30, the talks at 7 and then we will go to a nearby bar.

Photo of New York Open Statistical Programming Meetup group
New York Open Statistical Programming Meetup
See more events
Knewton
100 5th Avenue · New York, NY