This group is for users of Apache Spark in Vancouver. The goal of this Meetup is to build a close-knit community of Spark enthusiasts (from novice to experienced) that believes in knowledge sharing and collaborative learning. We want each member to be able to learn Spark, practice it and enlighten other members as well.
What Spark is:
Spark is a fast, fault-tolerant and expressive cluster-computing platform for interactive and real-time analytics applications. Although a fairly young technology, Spark's adoption is growing rapidly and it has received tremendous backing from several companies including Amazon, Google, SAP, Databricks and Cloudera. But more importantly, it has already been successful in solving some complex big data problems in various industries because of its powerful features and usability.
We’ll learn about Spark and its related projects, including Shark (Hive-on-Spark), Spark Streaming, GraphX, BlinkDB and MLlib. Meetups can be about the various Spark features, design patterns and best practices for deployment. In addition, we want to see Spark in action through demos or case studies in machine learning and data analytics problems. Other topics of interest are the integration of Spark with other tools and its integration in the data science process. Everything that Sparks or Sparkles is welcome! Please join us if you are already using Spark or if you want to skate to where the puck is going to be.
R is a predominant tool for data scientists and statisticians. With thousands of open-source packages available, R users can easily do all kinds of data processing tasks (exploratory analysis, visualization, forecasting, machine learning, etc) - all within the same platform. For example, the dplyr and ggplot2 packages greatly simplified data manipulations and interactive visualization. What's difficult is doing these things on very large datasets. This is because R users usually run these tasks on a single thread and can process only data that fits in a single machine’s memory.
To enable fast data analysis on terabytes or even petabytes of data, SparkR can be used to (interactively) run Spark jobs in parallel from the R console. This talk will introduce SparkR, Spark DataFrames and their interactions with Spark SQL. We will discuss some of its features and highlight the power of combining R and Spark through a demo. SparkR was recently merged into the new Spark 1.4 release.
Date and location will be announced later.