Past Meetup

Big Data Application Meetup 09/14: Cask Tracker, Ampool and SnappyData!

This Meetup is past

125 people went

Details

Shoutout to Ampool (http://www.ampool.io/) and Cask (http://cask.co) for kindly sponsoring and hosting this meetup!

Cask will also be giving away a BB-8 Droid (http://www.sphero.com/starwars/bb8). Enter the raffle on the day of the event for a chance to win.

AGENDA

6:00 - 6:30 - Socialize over food and beer(s)

6:30 - 8:00 - Talks

TALKS

Talk #1: "Who Moved my Data? - Why tracking changes and sources of data is critical to your data lake success” - by Russ Savage, Cask

Talk #2: One size doesn’t fit all: making a case for Federated Data Science using Ampool - by Nitin Lamba, Suhas Gogate, Ampool

Talk #3: Analyze Ad impressions at speed of thought using Spark 2.0 and Snappydata - by Jags Ramnarayan, SnappyData

ABSTRACTS

Talk #1: "Who Moved my Data? - Why tracking changes and sources of data is critical to your data lake success” - by Russ Savage, Cask

As data lake sizes grow, and more users begin exploring and including that data in their everyday analysis, keeping track of the sources for data becomes critical. Understanding how a dataset was generated and who is using it allows users and companies to ensure their analysis is leveraging the most accurate and up to date information. In this talk, we will explore the different techniques available to keep track of your data in your data lake and demonstrate how we at Cask approached and attempted to mitigate this issue.

Talk #2: One size doesn’t fit all: making a case for Federated Data Science using Ampool - by Nitin Lamba, Suhas Gogate, Ampool

Anomaly detection is a very common pattern used not only in financial transactions but also in finding abnormal behavior in health monitoring and IoT. What’s even more common is multiple analytical tools used in data science (Python, R, Apache Spark, to name a few) especially in large multi-tenant environments. Enterprises spend a lot of time moving & copying data to cater to these needs. Instead of having disparate back-end systems feed these tools, a simpler approach is to separate the concerns for compute and fast data serving.

In this talk, we will walk through such an anomaly detection use-case, where an in-memory data service layer serves hot, high-value data to different tools from a single, scalable cluster. This not only reduces data copies but also mitigates operational complexity (less number of moving parts). We illustrate how a single data flow can use these multiple engines, making timely actionable insights a reality, and run concurrent analytics workloads at in-memory speeds.

Talk #3: Analyze Ad impressions at speed of thought using Spark 2.0 and Snappydata - by Jags Ramnarayan, SnappyData

In Ad Analytics you have to deal with consolidated ad impression streams from many sites, cleanse it, manage the deluge by pre-aggregating and tracking metrics per minute, store all recent data in a in-memory store along with history in a data lake and permit interactive analytic queries at this constantly growing data.

Rather than stitching together multiple clusters as proposed in Lambda, we walk through a design where everything is achieved in a single, horizontally scalable Spark 2.0 cluster - stream ingestion(parallel ingest, continuous stream analytics), storing into a in-memory store, overflowing to Hadoop and interactive analytic queries that combines history with streams. A design that is simpler, and a lot more efficient.

We cover how the new Spark 2.0 enhancements make continuous analytics very simple but also talk about how deeply integrating a transactional+analytics in-memory database fully collocated with Spark executors offers significant benefits - Spark is now capable of managing mutable, transactionally consistent data, indexes, and can run concurrent analytics queries at in-memory speeds.

SPEAKER BIOS

• Russ Savage is leading the application engineering team at Cask, focusing on building end to end big data applications using the Cask Data Application Platform (CDAP). He believes that the true value of Hadoop and other big data technologies is only unlocked when they are used to solve problems an provide value to the business. He previously worked at Elastic as a solutions architect building tools that combined the many internal data sources to provide new insights to the company.

• Nitin Lamba leads product management at Ampool, a company he co-founded last year. Prior to Ampool, he worked at a robotics company, which builds ocean drones using a real-time Java platform. Before that industrial IoT start-up, he had been with Pivotal for over a year leading in-memory data grid and monitoring/management of Data Fabric products.

• Suhas Gogate has over 22 years of experience in building distributed computing systems & Database internals. He worked on Big Data and Hadoop platform for several years at Yahoo!, Netflix, Hortonworks, EMC-Greenplum/Pivotal from early stage of Hadoop development and contributed various innovative features to Hadoop. He also provided several talks & trainings on advance Hadoop technologies in various meet-ups and conferences.

Suhas was one of the founding members at Hortonworks and also the founder and PMC member of Apache Ambari project, an open source Install, Management and Monitoring solution for Hadoop.

• Jags Ramnarayan is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory products.

ARRIVAL AND PARKING

Cask HQ is a few minutes walk from the California Avenue Caltrain Station.

Also, Cask HQ has its own parking lot, but it will certainly not accommodate all guests. Please use parking lots available nearby: