• Productionizing your Streaming Jobs


    Register at: https://www.brighttalk.com/webcast/12891/202715 ---- Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming: • Motivation and most common use cases for Spark Streaming • Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns • Performance Optimization Technique ---- Register at: https://www.brighttalk.com/webcast/12891/202715

  • Enabling Exploratory Analysis of Large Data with R and Spark

    Register at https://www.brighttalk.com/webcast/12891/202705 ---- R has evolved to become an ideal environment for exploratory data analysis. The language is highly flexible - there is an R package for almost any algorithm and the environment comes with integrated help and visualization. SparkR brings distributed computing and the ability to handle very large data to this list. SparkR is an R package distributed within Apache Spark. It exposes Spark DataFrames, which was inspired by R data.frames, to R. With Spark DataFrames, and Spark’s in-memory computing engine, R users can interactively analyze and explore terabyte size data sets. In this webinar, Hossein will introduce SparkR and how it integrates the two worlds of Spark and R. He will demonstrate one of the most important use cases of SparkR: the exploratory analysis of very large data. Specifically, he will show how Spark’s features and capabilities, such as caching distributed data and integrated SQL execution, complement R’s great tools such as visualization and diverse packages in a real world data analysis project with big data. ---- Register at https://www.brighttalk.com/webcast/12891/202705

  • Apache Spark 2.0 presented by Databricks' Spark Chief Architect Reynold Xin

    In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release. The major themes for Spark 2.0 are: • Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs • Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries. • Tungsten Phase 2: Speed up Apache Spark by 10X Register at https://www.brighttalk.com/webcast/12891/202021

  • GraphFrames: DataFrame-based graphs for Apache Spark

    Needs a location

    Please register at: https://www.brighttalk.com/webcast/12891/199003 --- GraphFrames bring the power of Apache Spark DataFrames to interactive analytics on graphs. Expressive motif queries simplify pattern search in graphs, and DataFrame integration allows seamlessly mixing graph queries with Spark SQL and ML. By leveraging Catalyst and Tungsten, GraphFrames provide scalability and performance. Uniform language APIs expose the full functionality of GraphX to Java and Python users for the first time. In this talk, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms. For experts, this talk will also include a few technical details on design decisions, the current implementation, and ongoing work on speed and performance optimizations. --- Please register at: https://www.brighttalk.com/webcast/12891/199003

  • Spark MLlib: From Quick Start to Scikit-Learn

    In this webcast, Joseph Bradley from Databricks will be speaking about Spark’s distributed Machine Learning Library - MLlib. We will start off with a quick primer on machine learning, Spark MLlib, and a quick overview of some Spark machine learning use cases. We will continue with multiple Spark MLlib quick start demos. Afterwards, the talk will transition toward the integration of common data science tools like Python pandas, scikit-learn, and R with MLlib About the Presenter Joseph Bradley is a Software Engineer and Spark Committer working on MLlib at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon U. in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs. Please register at http://go.databricks.com/spark-mllib-from-quick-start-to-scikit-learn .

  • Jump Start into Apache Spark and Databricks

    Denny Lee, Technology Evangelist with Databricks, will provide a jump start into Apache Spark and Databricks. Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download. This introductory level jump start will focus on the following scenarios: • Quick Start on Spark: Provides an introductory quick start to Spark using Python and Resilient Distributed Datasets (RDDs). We will review how RDDs have actions and transformations and their impact on your Spark workflow. • A Primer on RDDs to DataFrames to Datasets: This will provide a high-level overview of our journey from RDDs (2011) to DataFrames (2013) to the newly introduced (as of Spark 1.6) Datasets (2015). • Just in Time Data Warehousing with Spark SQL: We will demonstrate a Just-in-Time Data Warehousing (JIT-DW) example using Spark SQL on an AdTech scenario. We will start with weblogs, create an external table with RegEx, make an external web service call via a Mapper, join DataFrames and register a temp table, add columns to DataFrames with UDFs, use Python UDFs with Spark SQL, and visualize the output - all in the same notebook. To join this event, please register at Jump Start into Apache Spark and Databricks (http://go.databricks.com/jump-start-into-apache-spark-and-databricks).

  • Apache Spark 1.6 presented by Databricks co-founder Patrick Wendell

    Scheduled for: 12/01/2015 9:00am PT, 12:00pm ET, 5:00pm UTC Register Now (http://go.databricks.com/apache-spark-1.6-with-patrick-wendell) In this webcast, Patrick Wendell from Databricks will be speaking about Spark's new 1.6 release. Spark 1.6 will include (but not limited to) a type-safe API called Dataset on top of DataFrames that leverages all the work in Project Tungsten to have more robust and efficient execution (including memory management, code generation, and query optimization) [SPARK-9999], adaptive query execution [SPARK-9850], and unified memory management by consolidating cache and execution memory [SPARK-10000].

  • Transitioning from Traditional DW to Spark in OR Predictive Modeling

    Scheduled for October 21st,[masked]:00am-11:00am PST, 1:00pm-2:00pm EST, 5:00pm-6:00pm UTC The prevailing issue when working with Operating Room (OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures. In this three part session, Ayad Shammout and Denny will show: 1) How we tried to solve this problem using traditional DW techniques2) How we took advantage of the DW capabilities in Spark AND easily transition to Spark MLlib so we could more easily predict available OR block times resulting in better OR utilization and shorter wait times for patients.3) Some of the key learnings we had when migrating from DW to Spark. Please sign up: http://go.databricks.com/transitioning-from-traditional-dw-to-spark-in-or-predictive-modeling

  • Apache Spark 1.5 presented by Databricks co-founder Patrick Wendell

    Session Info In this webcast, Patrick Wendell from Databricks will be speaking about Spark's new 1.5 release. Spark 1.5 ships Spark's Project Tungsten initiative, a cross-cutting performance update that uses binary memory management and code generation to dramatically improve latency of most Spark jobs. This release also includes several updates to Spark's DataFrame API and SQL optimizer, along with new Machine Learning algorithms and feature transformers, and several new features in Spark's native streaming engine. Register now at: https://www.brighttalk.com/webcast/12891/168177 Bio Patrick Wendell is a co-founder and engineer at Databricks as well as a founding Committer and PMC member of Apache Spark. In the Spark project, Patrick has acted as release manager for several Spark releases, including Spark’s previous 1.4 release. Patrick also maintains several subsystems of Spark’s core engine. Before helping start Databricks, Patrick obtained an M.S. in Computer Science at UC Berkeley. His research focused on low latency scheduling for large-scale analytics workloads. He holds a B.S.E in Computer Science from Princeton University.

  • Spark DataFrames: Simple and Fast Analysis of Structured Data

    Spark DataFrames: Simple and Fast Analysis of Structured Data Michael Armbrust This session will provide an introductory and technical overview of Spark’s DataFrame API. First, we’ll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or other external data sources To join the session, please register at BrightTalk: https://www.brighttalk.com/webcast/12891/166495