Recent Developments in SparkR for Advanced Analytics

Details
Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then, we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.
Speaker: Xiangrui Meng is a technical lead of machine learning and data science at Databricks. His main interests center around building simple and scalable solutions for advanced analytics. He is also a PMC member and committer of Apache Spark, primarily contributing to MLlib, PySpark, and SparkR. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His Ph.D. work at Stanford is on randomized algorithms for large-scale linear regression problems.
Company: Databricks’ vision is to anyone to easily build and deploy advanced analytics solutions. We were founded by the team who created Apache® Spark™, the powerful open source data processing engine, and provide the just-in-time data platform to simplify data integration, real-time experimentation, and robust deployment of production applications.

Recent Developments in SparkR for Advanced Analytics