December Meetup during Strata!


Details
http://photos2.meetupstatic.com/photos/event/d/0/d/3/600_444113459.jpeg
We are holding the Spark Meetup during the Starta + Hadoop World Conference in Dec. Together with rest of the Big Data/Data Science communities in Singapore, there will be a series of meetups held during that week.
We will have speakers from Databricks, IBM, Tachyon Nexus and Cloudera!
Special thanks to O'Reilly Media for the room to host this meetup.
To our members, we have a 20% special discount code UGSGSPARK.You can purchase your tickets at http:/oreil.ly/SHWSG15UG.
We have a FREE pass to give away. Come join the meetup (https://www.meetup.com/Spark-Singapore/events/226693775/) together with PyData Singapore on Nov 17 to have a chance to win, proudly sponsored by O'Reilly Media.
=============
Speaker: Reynold Xin is a committer on Apache Spark. He is also a co-founder of Databricks. Before Databricks, he was pursuing a Ph.D. in the University of California-Berkeley AMPLab.
Title: State of Spark, and where it is going.
Abstract:
In his extended version of his keynote talk, Reynold will look back and review Spark’s growth in adoption, use cases, and development. He will then look forward and discuss both technical initiatives and the evolution of the Spark community for 2016.
2015 is the year of data science and platformization for Apache Spark. With new high-level APIs (e.g. DataFrames, machine learning pipelines, R) and extension points, Spark is accessible to a wider set of users and can plug in a myriad of data sources, algorithms, and external packages. 2015 also marks the beginning of Project Tungsten, a major revamp of Spark’s execution engine to improve its robustness and performance. In 2016, we will continue pushing the boundaries of these dimensions, making Spark even easier and more powerful.
=============
Speaker: Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark. Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
Title: Spark After Dark 1.5: Deep Dive Into Latest Perf and Scale Improvements in Spark Ecosystem
Abstract
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
-
Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
-
Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
-
Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
-
Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
-
Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
-
Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
Demos:
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
=============
Speaker: Bin Fan is a software engineer at Tachyon Nexus. He is one of the top committers in the open source community of Tachyon project. Prior to Tachyon Nexus, he worked in Google to build the next-generation storage infrastructure and won Google's Technical Infrastructure award. Bin had publication in most prestigious acadmic conferences in Computer Science including SIGCOMM, SOSP, NSDI, and got his PhD in Computer Science from Carnegie Mellon University
Title: Introduction to Memory-centric Distributed Storage System Tachyon and Its Use Case
Abstract
Memory is the key to fast Big Data processing. This has been realized by many, and frameworks such as Spark and Shark already leverage memory performance. As data sets continue to grow, storage is increasingly becoming a critical bottleneck in many workloads.
To address this need, we have developed Tachyon, a memory centric fault-tolerant distributed file system, which enables reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Different from traditional replication scheme, Tachyon achieves fault-tolerance by leveraging lineage information.
Over two years of research and production deployment, Tachyon as an open-source system has more than 160 contributors from over 50 institutions, including IBM, Yahoo, Intel, Redhat, Baidu, and Tachyon Nexus. We had a major release, version 0.8 in October 2015, including numerous new features. In particular, the latest version support mounting of multiple under storage systems and transparent naming enables more exciting use cases for Tachyon users, remote write to allow users to write to Tachyon through remote workers, detailed monitoring of the master and workers, and integration with Yarn and Mesos. In the presentation, we will also explore several potential industry use cases enabled by the new features.
=============
Speaker: Kostas Sakellis is a Software Engineer and contributor to Apache Spark. Previous to that, he contributed to the extensibility effort on Cloudera Manager. Before Cloudera, Kostas did a 6 year stint at Amazon working across various teams including platform infrastructure. Kostas has a Bachelors of Mathematics in Computer Science from the University of Waterloo.
Title: Apache Spark 101: Taming Big Data
Abstract
Apache Spark has become a very popular data processing framework due to its speed and easy of use. In this talk we will start at the beginning and give a brief technical introduction to key Spark concepts and differentiators. We will dive deep into the core architecture of Spark and discuss improvements over MapReduce. We will conclude with a few examples of Spark being used to solve real world problems.
=============
Any many more. Will post updates once they are confirmed.
Thanks to John for helping out!

December Meetup during Strata!