Skip to content

Apache Spark - Cool Magic and ML Futures

Photo of Nancy Berlin
Hosted By
Nancy B.
Apache Spark - Cool Magic and ML Futures

Details

As part of the Cognitive Frameworks Festival being held in various locations throughout San Francisco from June 5th to 8th, join us for an incredible lineup of speakers on June 8th.

"PySpark Beyond Shuffling - Why it isn't Magic - but also where there is some really cool magic"Speaker: Holden Karau - Apache Spark Committer, Spark Technology Ctr - IBM

Apache Spark is one the most popular general purpose distributed systems in the past few years. Apache Spark has APIs in Scala, Java, Python and more recently a few different attempts to provide support for R, C#, and Julia. This talk looks at Apache Spark from a performance/scaling point of view and the work we need to do to be able to handle large datasets. This talk focuses on how the Python APIs (known as PySpark) works, and where the magic breaks down.

In essence parts of this talk could be considered "the impact of design decisions from years ago and how to work around them." It's not all doom and gloom though, we will explore the new APIs and the exciting new things we can do with them with a brief detour into how to work around some of the trade-offs in the new APIs - but mostly focused on the new exciting shiny things we can play with. A basic background with Apache Spark will probably make the talk more exciting or depressing depending on your point of view but for those new to Apache Spark just enough to understand whats going will be covered at the start. The presenter would of course encourage you to buy and read her books on the topic ("Learning Spark" & "High Performance Spark"), because which presenter doesn't do that.

"Apache SystemML: State of the Project and Future Plans"Speaker: Frederick Reiss - Chief Architect, Spark Technology Center - IBM

Apache SystemML is a system and language that supports rapid development of custom machine learning algorithms for large scale problems. SystemML allows data scientists to write code once in terms of high-level linear algebra operations, then automatically generate low-level parallel versions of the program that are tuned to the characteristics of the data and different parallel execution frameworks. The system consists of two major components: An optimizer that automatically parallelizes high-level code; and a runtime that evaluates the resulting execution plans at scale on Apache Hadoop, on Apache Spark, on large multi-core systems, and, more recently, on GPUs. This talk will start by describing the history of the project. I'll explain how the original research team from IBM advanced the state of the art in automatic parallelization and scalable linear algebra to build the optimizer and runtime, and how we turned the resulting research code into Apache SystemML. I'll describe how Apache SystemML has been used to implement state-of-the-art algorithms in the field. Finally, I'll talk about recent work on enhancing the system with compressed linear algebra, automatic generation of custom linear algebra kernels, and support for deep learning.

And more...couple of Lightening Talks:

  1. Hyperparameter Optimization - when scikit-learn meets PySparkSpeaker: Sven Hafeneger - Software Developer - Data Science Experience, Notebooks - IBM

Spark is not only useful, when you have big data problems. If you have a relatively small data set you might still have a big computational problem. One problem is the search for optimal parameters for ML algorithms. Normally, a data scientist has a laptop with 4 cores (8 threads), that means it will take some time to perform a grid search …However, if you use Spark, then it opens the possibility to have the grid search taken out on a cluster with a higher degree of parallelism, thus reduce the time to find optimal parameters. This leads to a more interactive workflow and more fun during the modelling phase.

  1. TBD - will post shortly.

Please ensure you bring a picture ID. Light refreshments will be served.

Photo of Data, Cloud and AI in Silicon Valley group
Data, Cloud and AI in Silicon Valley
See more events