Taking Spark To The Clouds and Incremental Updates in Machine Learning Models

Name: Taking Spark To The Clouds and Incremental Updates in Machine Learning Models
Start: 2016-06-02T18:00:00-07:00
End: 2016-06-02T21:00:00-07:00
Location: Columbia Square, 8th floor

Hosted By

Thomas L. and 2 others

Taking Spark To The Clouds and Incremental Updates in Machine Learning Models

Details

For PSUG's second meetup, we are inviting Qubole to come speak as well as one of our local members Robert Dodier. Our meetup will be held at the same building on the same floor.

Agenda:

6:00: Food/drinks arrive

6:20: Talk #1: Taking Spark to the Clouds

7:20: Questions

7:30: Talk #2: Exact and approximate incremental updates in machine learning models

8:20: Questions

8:30: chill + relax = chillax

Description

Talk #1:

Taking Spark to the Clouds

Abstract:

Spark provides significant speed boosts over competing tools thanks to its memory-based architecture. According to stats on Apache.org, Spark can “run programs up to 100 times faster than Hadoop MapReduce in memory, or 10 times faster on disk.” Spark is typically deployed in a dedicated data center as a next step in an organizations big data deployment strategy to gain deeper and faster insights. However, as the advantages of big data in the cloud become more apparent and gain wider adoption, can organizations also reap the benefits of Spark as a service without sacrificing its primary benefit—speed? In other words, is Spark ready for the cloud? In this session, Jove Kuang, Solutions Architect with Qubole, will explore the benefits of separating compute and storage. He’ll also explore use cases utilizing SparkSQL, SparkR, and other languages that can be used with Qubole’s Notebook UI.

Speaker: Jove Kuang (https://www.linkedin.com/in/jovek)

https://media.licdn.com/mpr/mpr/shrinknp_400_400/p/8/005/072/19f/0dc84cf.jpg

Talk #2:

Exact and approximate incremental updates in machine learning models

Abstract:

Suppose we have a ML model trained on a data set. If we collect more data, typically it's assumed that to construct a model for the entire data set, we must train the model on all the data. But for some models, one can avoid training on the entire data set and instead combine the new data with a summary of the previous data and obtain exactly the same result as training on the entire data set. The summary, if it exists, is called a "sufficient statistic". I'll talk about some basic models for which sufficient statistics exist (logistic regression, quadratic discriminant) and take a look at some code, based on Spark, for these models. I'll also talk about more complex models for which there do not exist sufficient statistics (tree-structured models, neural networks) and talk about a general approach for working around the lack of sufficient statistics.

Speaker: Robert Dodier (https://www.linkedin.com/in/robertdodier)

Events in Portland, OR

Portland Spark User Group

See more events

Portland Spark User Group

Thursday, June 2, 2016
6:00 PM to 9:00 PM PDT

Columbia Square, 8th floor

111 SW Columbia St 8th floor · Portland, OR

Portland Spark User Group

public group

Taking Spark To The Clouds and Incremental Updates in Machine Learning Models