Study Group: EdX Intro to Big Data with Spark


Details
Apache Spark is a fast-growing open-source big data framework for batch-processing, distributed machine learning and streaming data analysis that results in 10x-100x speed up over Hadoop MapReduce for many tasks.
-------------------
There are a couple of 4-5 week EdX Spark-related MOOCs coming up:
• Intro to Big Data with Apache Spark (4 weeks starting June 1st) (https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x#!)
• Scalable Machine Learning with Spark (5 weeks starting June 29th) (https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x#!)
-------------------
This Meetup event is for the first of the two Spark MOOCs, so our first session will be Saturday June 6th and will run weekly for the duration of the MOOC.
So if you're interested in following along with a few like-minded souls, then come along!
We'll announce a venue shortly, but will strive for something near the CBD that doesn't cost anything (or doesn't cost much). To be announced soon!
See below for a bit more detail, or check out the URL above for the full run-down.
-------------------
EdX Intro to Big Data with Apache Spark
Organizations use their data for decision support and to build data-intensive products and services, such as recommendation, prediction, and diagnostic systems. The collection of skills required by organizations to support these functions has been grouped under the term Data Science. This course will attempt to articulate the expected output of Data Scientists and then teach students how to use PySpark (part of Apache Spark) to deliver against these expectations. The course assignments include Log Mining, Textual Entity Recognition, Collaborative Filtering exercises that teach students how to manipulate data sets using parallel processing with PySpark.
This course covers advanced undergraduate-level material. It requires a programming background and experience with Python (or the ability to learn it quickly). All exercises will use PySpark (part of Apache Spark), but previous experience with Spark or distributed computing is NOT required. Students should take this Python mini-quiz (http://www.mypythonquiz.com/) before the course and take this Python mini-course (http://ai.berkeley.edu/tutorial.html#PythonBasics) if they need to learn Python or refresh their Python knowledge.
What you'll learn
• Learn how to use Apache Spark to perform data analysis
• How to use parallel programming to explore data sets
• Apply Log Mining, Textual Entity Recognition and Collaborative Filtering to real world data questions
• Prepare for the Spark Certified Developer exam (not required)

Study Group: EdX Intro to Big Data with Spark