PyData SG @ Strata + Hadoop World 2016!

Name: PyData SG @ Strata + Hadoop World 2016!
Start: 2016-12-06T19:00:00+08:00
End: 2016-12-06T22:00:00+08:00
Location: Suntec Singapore Convention and Exhibition Centre Summit 2 (Level 3)

Hosted by Talha O. and Anthony Y.

PyData Singapore

Details

VENUE IS NOW SUMMIT 2. See http://www.suntecsingapore.com/plan-your-event/space/floorplan-level-3/ .

https://a248.e.akamai.net/secure.meetupstatic.com/photos/event/c/6/9/7/600_456110839.jpeg

For many of us, December is the time to wind down and relax, catch-up with old and new friends, indulge in good food, reflect on how the year has been, and perhaps, complete a couple of Python MOOC courses that you have been wanting to take.

For PyData Singapore, we will be celebrating our year-end meetup at Strata+Hadoop World 2016 and we want you to be part of it!

It will be on December 6th evening, and it's free and open to all.

Agenda

• 7:00pm - 7:45pm Improving PySpark Performance: Spark performance beyond the JVM - Holden Karau

Abstract: This talk covers a number of important topics for making scalable Apache Spark programs with a special focus on Python - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. The talk also includes Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python and UDF performance. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead of Python with Spark is too high.

• 7:45pm - 8:25pm Using PySpark and MlLib -
Juliet Hougland

Abstract: Spark MLlib is a library for performing machine learning and associated tasks on massive datasets. With MLlib, fitting a machine-learning model to a billion observations can take only a few lines of code, and leverage hundreds of machines. This talk will demonstrate how to use Spark MLlib to fit an ML model that can predict which customers of a telecommunications company are likely to stop using their service. It will cover the use of Spark's DataFrames API for fast data manipulation, as well as ML Pipelines for making the model development and refinement process easier.

• 8:25pm - 9:00pm Stream-1st Architecture, Apache Flink & Other Emerging Technologies - Ellen Friedman

Abstract: There’s a revolution underway in how people work with data. Streaming data is no longer seen as a special use case – and that’s good a thing given that streaming is a better fit to the way life happens. This talk takes a look at the benefits of stream-first architecture and some of the emerging technologies that enable best practices with streaming. These include message transport with Apache Kafka or MapR Streams, and stream processing with Apache Flink. As a top level Apache project, Flink has an active and growing international community. The Flink engine offers robust, accurate and highly scalable real time stream processing, and it also works in batch. We’ll briefly explore how these new approaches and technologies are being put to use in real world situations.

Bios:

Holden Karau is transgender Canadian, and an active open source contributor. When not in San Francisco working as a software development engineer at IBM's Spark Technology Center, Holden talks internationally on Spark and holds office hours at coffee shops at home and abroad. Holden is a co-author of numerous books on Spark including High Performance Spark (which she believes is the gift of the season for those with expense accounts) & Learning Spark. She makes frequent contributions to Spark, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science. Outside of software she enjoys playing with fire, welding, scooters, poutine, and dancing.

https://www.linkedin.com/in/holdenkarau

Juliet Hougland answers complex business problems using statistics to tame multi-terabyte datasets. Juliet's been sought after by Cloudera’s customers as a field-facing data scientist advising on which tools to use, teaching how to use them, recommending the best approach to bring together the right data to answer the business problem at hand and building production machine learning models. For many years Juliet has been a contributor in the open source community working on projects such as Apache Spark, Scalding, and Kiji. Juliet is the Head of Data Science for Engineering at Cloudera.

https://www.linkedin.com/in/jhlch

Ellen Friedman is a consultant and commentator on big data topics. Active in open source, Ellen is committer for Apache Drill and Apache Mahout projects. She has a PhD in biochemistry, years of experience as a research scientist and has written about a wide range of technical topics. She is co-author of many short O’Reilly big data books including the Practical Machine Learning series, Time Series Databases, Streaming Architecture and the latest, Introduction to Apache Flink. Follow Ellen Twitter as @ Ellen_Friedman.

https://www.linkedin.com/in/ellen-friedman-a93743

Updates

• Plans for next year: looking for speakers, venue and food sponsors.

Join us on Facebook and Twitter

https://www.facebook.com/groups/pydatasg/
https://www.twitter.com/pydatasg

PyData Singapore

Engineers.SG

JetBrains

O’Reilly Media

NumFocus

PyData SG @ Strata + Hadoop World 2016!

PyData Singapore

Details

Related topics

Sponsors

Engineers.SG

JetBrains

O’Reilly Media

NumFocus

You may also like