Skip to content
Spark 101

Details

Note: Spark is thoroughly covered at Data By the Bay (http://data.bythebay.io), including two talks by Databricks engineers, on Spark streaming (http://schedule.bythebay.io/event/6YJv/deep-dive-and-best-practices-of-spark-streaming) and memory management (http://schedule.bythebay.io/event/6YJu/deep-dive-spark-memory-management). You will need to register soon week to attend the conference, May 15-20.

MONSTROUSLY IMPORTANT: you need to re-register for Bloomberg security at http://www.bloomberg.com/event-registration/?id=53944

Talk 1: Automatic checkpointing for SparkNimbus Goehausen, Senior Software Engineer, Bloomberg

Dealing with problems that arise when running a long process over a large dataset can be one of the most time consuming parts of development. For this reason, many data engineers and scientists will save intermediate results and use them to quickly zero in on the sections which have issues and avoid rerunning sections that are working as intended. For data pipelines that have several sections, dealing with the saving and loading of intermediate results can become almost as complicated as the core problem that the developers are trying to solve. Changes that are made may require previously saved intermediate results to be invalidated and overwritten. This process is typically manual and it's very easy for a developer to mistakenly use outdated intermediate results. These problems can be even worse when multiple developers are sharing intermediate results.

These issues can be addressed by the introduction of a logical signature for datasets. For each dataset, we'll compute a signature based on the indentity of the input and on the logic applied. If the input and logic stay the same for some dataset between two executions, the signature will be consistent and we can safely load previously saved results. If either the input or the logic change then the signature will change and the dataset will be freshly computed. With these signatures, we can implement automatic checkpointing that works even among several concurrent users and other useful features as well.

Nimbus Goehausen is a senior software engineer at Bloomberg where he works on spark infrastructure and applications. Prior to joining Bloomberg he worked at Radius Intelligence where he developed fuzzy business matching pipelines using spark and hadoop. Having experienced many pains involved with developing complex big data pipelines, he's looking to find ways of improving the development experience with spark.

Nimbus has a bachelors degree in physics and computers science from UC Berkeley with a research focus in robotic control.

Talk 2: Spark 101Mohammed Guller, Principal Architect, Glassbeam

Join us for a talk on Big Data and Apache Spark by Glassbeam's Principal Architect, Mohammad Guller. Spark, a fast general-purpose cluster computing framework, has set the Big Data world on fire. It has become the most active open-source big data project. This is an introductory talk for those who want to get into Big Data and learn about Spark, but don't know where to start.

The presentation will start with a discussion of Big Data and the challenges associated with it as well the benefits that it provides. Mohammed will discuss how organizations are getting value out of Big Data.

He will next discuss some of the important technologies created in the last few years to handle Big Data. The Big Data space is exploding with new technologies. It seems like a new Big Data project is getting open-sourced every week.

Next, Mohammed will introduce Spark and talk about its role in the Big Data ecosystem. He will discuss why Spark is hot, why people are replacing Hadoop MapReduce with Spark, and what kind of applications really benefit from Spark. He will also discuss Spark's high-level architecture.

Finally, he will introduce the key libraries that come prepackaged with Spark. These libraries simplify a variety of analytical tasks, including interactive analytics, stream data processing, graph analytics, and machine learning.

Mohammed Guller is the principal architect at Glassbeam, where he leads the development of advanced and predictive analytics products. He is also the author of the recently published book, "Big Data Analytics with Spark." He is a Big Data and Spark expert. He is frequently invited to speak at Big Data–related conferences. He is passionate about building new products, Big Data analytics, and machine learning.

Over the last 20 years, Mohammed has successfully led the development of several innovative technology products from concept to release. Prior to joining Glassbeam, he was the founder of TrustRecs.com, which he started after working at IBM for five years. Before IBM, he worked in a number of hi-tech start-ups, leading new product development.

Mohammed has a master’s of business administration from the University of California, Berkeley, and a master’s of computer applications from RCC, Gujarat University, India.

Photo of SF Data and AI Engineering group
SF Data and AI Engineering
See more events
Bloomberg
140 New Montgomery 22 Floor · San Francisco, CA