Kick off the Fall with DC Spark!


Details
Hi, all. I hope everyone has been enjoying Summer and ready for the Fall! Please come out to hear some exciting Spark talks from Capital One and SnappyData.
TALK 1
Title: Deep dive into analysis of scalable DB Solution for Data Analytics Platform.
Abstract: To implement a scalable database solution for our analytics needs CapitalOne team has done analysis on Cassandra and HBase that helped us in making right choice for the usecase. We would like to present our analysis and the challenges we faced and experiences we incurred to the audience. We will be presenting deep dive analysis on each of the aspect we consider as important and will be presenting some of metrics that we have collected.
Presenter Bio: Srinivasarao Daruna: Senior Data Engineer at Capital One. Databricks Certified Spark Developer with strong knowledge on Big Data, Spark and wide range of Hadoop Eco System tools.
TALK 2
Title: Explore big data at speed of thought with Spark 2.0 and SnappyData
Abstract:
Data exploration often requires running aggregation/slice-dice queries on data sourced from disparate sources. You may want to identify distribution patterns, outliers, etc and aid the feature selection process as you train your predictive models. As you begin to understand your data, you want to ask ad-hoc questions expressed through your visualization tool (which typically translates to SQL queries), study the results and iteratively explore the data set through more queries. Unfortunately, even when data sets can be in-memory, large data set computations take time breaking the train of thought and increasing time to insight . We know Spark can be fast through its in-memory parallel processing. But, Spark 1.x isn’t quite there. Spark 2.0 promises to offer 10X better speed than its predecessor. Spark 2.0 ushers some impressive improvements to interactive query performance. We first explore these advances - compiling the query plan eliminating virtual function calls, and other improvements in the Catalyst engine. We compare the performance to other popular popular query processing engines by studying the spark query plans. We then go through SnappyData (an open source project that integrates Spark with a database that offers OLTP, OLAP and stream processing in a single cluster) where we use smarter data colocation and Synopses data (.e.g. Stratified sampling) to dramatically cut down on the memory requirements as well as the query latency. We explain the key concepts in summarizing data using structures like stratified sampling by walking through some examples in Apache Zeppelin notebooks (a open source visualization tool for spark) and demonstrate how we can explore massive data sets with just your laptop resources while achieving remarkable speeds.
Presenter Bio: Jags is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory products.
PARKING:
Free parking is plentiful and available on-site

Kick off the Fall with DC Spark!