Skip to content

Spark DataFrames + Spark on Google's GCP (Spark Summit East meetup)

Photo of Andy Konwinski
Hosted By
Andy K. and Matthew H.
Spark DataFrames + Spark on Google's GCP (Spark Summit East meetup)

Details

At this meetup, co-organized with Spark Summit East (http://spark-summit.org/east), we will hear first about Spark on Google Compute Platform, and second, about the new Spark DataFrame abstraction.

Schedule for the evening:
6:30 - 7:00 :: Mingling
7:00 - 8:15 :: Talks
8:15 - 9:00 :: Mingling

The talks will be live streamed, and the video will be published on the Apache Spark channel (https://www.youtube.com/user/TheApacheSpark) on YouTube.

--------------

First, a demo of Spark on Google Cloud Platform.

  1. Seamlessly deploy Apache Spark 1.2 on Google Cloud Platform with bdutil - CLI. Start developing on your cluster within minutes.

  2. Enable users to take advantage of Apache Spark and Google Dataflow, with the open source Spark Dataflow Runner enabling you to run your code on-premise on Spark clusters and in the cloud with the managed dataflow service

Second, Databricks software engineer and lead of the Spark SQL project Michael Armbrust will present the new DataFrame abstraction in Spark for large-scale data science. (EDITOR'S NOTE: Speaker change; previously was Reynold Xin)

Data frames in R and Python have become the de facto standards for data science. When it comes to Big Data however, neither R data frames nor Python data frames integrate well with Big Data tooling to scale up for use on large datasets.

Inspired by R and Pandas, Spark's DataFrame provides concise, powerful programmatic interfaces designed for structured data manipulation. In particular, when compared with traditional data frame implementations, it enables:

  • Scaling from kilobytes to petabytes of data

  • Reading structured datasets (JSON, Parquet, CSV, relational tables, ...)

  • Machine learning integration

  • Cross-language support for Java, Scala, and Python

Internally, the DataFrame API builds on Spark SQL's query optimization and query processing capabilities for efficient execution. Data scientists and engineers can use this API to more elegantly express common operations in data analytics. It makes Spark more accessible to a broader range of users and improve optimizations for existing ones.

Photo of Spark-NYC group
Spark-NYC
See more events
Google NYC
111 8th Ave entrance (door located closer to 16th street) · New York, NY