Revisiting the MapReduce Paradigm: an R-Specific View


Details
Speaking:
Dr. Norman Matloff is a professor of computer science at the University of California at Davis, and is a well-known expert in statistical computing and the R data language. His 2011 book on the R programming language, The Art of R Programming, is considered one of the leading works of its kind. He is on the editorial board of the Journal of Statistical Software, and has recently presented invited papers at Interface 2012–the Future of Statistical Computing, userR! 2012 and JSM 2013.
Topic:
Revisiting the MapReduce Paradigm: an R-Specific View
After years of hype, the MapReduce paradigm, at least as embodied in Hadoop, has recently come under serious scrutiny. Though the idea of conputation based on a distributed file system makes good sense, it is difficult to shoehorn many applications into the MapReduce framework. This is especially true for multipass applications.
These problems are partially ameliorated by the recently-introduced Spark language, which brings in higher-level parallel language constructs, and addresses the multipass problem via caching, yet other issues remain.
This talk will question the value of both Hadoop and Spark, and MapReduce in general, to us R users. Though Spark is an improvement on Hadoop and has an R interface, SparkR, I will argue that most R users are better off sticking to ordinary R methods. This means retaining the
distributed file idea, but using R's ordinary 'parallel' package (the portion corresponding to the old Snow package) for the computation and data processing. In this manner, the R programmer avoids having to deal with pesky Java configuration headaches and having to learn new, abstract language constructs. Most important, this approach, which I call Snowdoop, should yield much faster performance in many approach.
The presentation will be nonpolemic, and lively feedback from MapReduce adherents is hoped for.

Sponsors
Revisiting the MapReduce Paradigm: an R-Specific View