Skip to content

Taking Spark To The Clouds and Incremental Updates in Machine Learning Models

Photo of Thomas Lockney
Hosted By
Thomas L. and 2 others
Taking Spark To The Clouds and Incremental Updates in Machine Learning Models

Details

For PSUG's second meetup, we are inviting Qubole to come speak as well as one of our local members Robert Dodier. Our meetup will be held at the same building on the same floor.

Agenda:

6:00: Food/drinks arrive

6:20: Talk #1: Taking Spark to the Clouds

7:20: Questions

7:30: Talk #2: Exact and approximate incremental updates in machine learning models

8:20: Questions

8:30: chill + relax = chillax

Description

Talk #1:

Taking Spark to the Clouds

Abstract:

Spark provides significant speed boosts over competing tools thanks to its memory-based architecture. According to stats on Apache.org, Spark can “run programs up to 100 times faster than Hadoop MapReduce in memory, or 10 times faster on disk.” Spark is typically deployed in a dedicated data center as a next step in an organizations big data deployment strategy to gain deeper and faster insights. However, as the advantages of big data in the cloud become more apparent and gain wider adoption, can organizations also reap the benefits of Spark as a service without sacrificing its primary benefit—speed? In other words, is Spark ready for the cloud? In this session, Jove Kuang, Solutions Architect with Qubole, will explore the benefits of separating compute and storage. He’ll also explore use cases utilizing SparkSQL, SparkR, and other languages that can be used with Qubole’s Notebook UI.

Speaker: Jove Kuang (https://www.linkedin.com/in/jovek)

https://media.licdn.com/mpr/mpr/shrinknp_400_400/p/8/005/072/19f/0dc84cf.jpg

Talk #2:

Exact and approximate incremental updates in machine learning models

Abstract:

Suppose we have a ML model trained on a data set. If we collect more data, typically it's assumed that to construct a model for the entire data set, we must train the model on all the data. But for some models, one can avoid training on the entire data set and instead combine the new data with a summary of the previous data and obtain exactly the same result as training on the entire data set. The summary, if it exists, is called a "sufficient statistic". I'll talk about some basic models for which sufficient statistics exist (logistic regression, quadratic discriminant) and take a look at some code, based on Spark, for these models. I'll also talk about more complex models for which there do not exist sufficient statistics (tree-structured models, neural networks) and talk about a general approach for working around the lack of sufficient statistics.

Speaker: Robert Dodier (https://www.linkedin.com/in/robertdodier)

Photo of Portland Spark User Group group
Portland Spark User Group
See more events
Columbia Square, 8th floor
111 SW Columbia St 8th floor · Portland, OR