Past Meetup

Big Data Application Meetup 06/15

This Meetup is past

157 people went

Details

AGENDA

6:00 - 6:30 - Socialize over food and beer(s)

6:30 - 8:00 - Talks

TALKS

Talk #1: Building automated Data pipelines fast and code-free with Cask Hydrator, by Gokul Gunasekaran from Cask

Talk #2: Unified access framework for distributed data systems on HDFS, by Shivram Mani from Pivotal

Talk #3: Practical TensorFlow, by Illia Polosukhin from Google

ABSTRACTS

Talk #1: Building automated Data pipelines fast and code-free with Cask Hydrator, by Gokul Gunasekaran from Cask

Cask Hydrator is an extension to the open source Cask Data Application Platform (CDAP) that simplifies the process of developing and operating realtime and batch data pipelines on Hadoop. Hydrator’s web-based drag-and-drop UI allows users to quickly build hadoop-scalable, distro-agnostic data pipelines without writing any code.

Powered by CDAP (http://cdap.io), Hydrator provides ease of operability through metadata information, lineage, metrics and log collection in a single location. In this talk, we will build data pipelines, with real-life applications, that pull in data from multiple sources, train and use a machine learning model to classify data using Spark MLLib, and write data to different sinks. We will also delve under the covers to see how these data pipelines are transformed to a series of MapReduce/Spark jobs and also touch upon some interesting challenges we had to tackle while developing Hydrator.

Talk #2: Unified access framework for distributed data systems on HDFS, by Shivram Mani from Pivotal

We are in a world with multiple storage systems optimized for different data models. This poses the challenge of running aggregate analysis across the varied storage engines on HDFS.

PXF provides a unified extensible framework for solving this precise problem. The pluggable framework makes it very convenient to add plugins to support custom data sources. Existing plugins include loading and querying of data stored in HDFS, HBase and Hive. It supports a wide range of data formats such as Text, Avro, Sequence, Hive RCFile, ORC, Parquet and Avro formats and HBase.

Example use cases include using statistical and analytical functions along with filter pushdown from Postgres or Apache HAWQ on Hdfs, HBase and Hive data, joining in-database dimensions with HBase facts, leveraging analytical capabilities on Hadoop data files, and fast ingest of data into HAWQ for in-database processing and analytics.

PXF is an open source project that is currently being used by Apache HAWQ, and is on the process of being integrated with other SQL engines.

Talk #3: Practical TensorFlow, by Illia Polosukhin from Google

Deep Learning has unlocked new types of interactions, products, and understanding, especially in the last 5 years. The field is moving very quickly, and getting the latest innovations into released products is a challenge. TensorFlow is growing into a platform for letting researchers and industry collaborate, as it provides tools for experimentation and deep learning. This talk will cover some recent developments in Deep Learning and what possible experiences entrepreneurs and hackers can build using TensorFlow.

SPEAKER BIOS

• Gokul Gunasekaran is a software engineer at Cask Data where he works on the open source Cask Data Application Platform that enables rapid development, deployment, operationalization and management of Hadoop big data applications. He is also a committer of Apache Tephra™ (incubating) that brings ACID transactions to NoSQL DBs such as HBase. Before Cask, he was at Oracle/Sun Microsystems working on Data Analytics Accelerator (DAX) to improve DB performance on SPARC processors. He is an alumnus of BITS Pilani and Stanford University.

• Shivram Mani is committer of Apache HAWQ and a long term distributed systems enthusiast. He holds a Masters in Computer Science and interned with Google’s monetization engine. Following his internship he worked with Yahoo’s web search federation platform and the search analytics teams which was the first pilot team/user of hadoop. As a Staff engineer in Greenplum and Pivotal he worked on a Unified Storage System for Greenplum Hadoop and is currently working on PXF and HAWQ.

• Illia Polosukhin is a Researcher at Google who leads a team working on Natural Language Understanding research using Deep Learning. His team also develops better tools within TensorFlow, such as Scikit Flow and TensorFlow Learn.

ARRIVAL AND PARKING

Cask HQ is a few minutes walk from the California Avenue Caltrain Station.

Also, Cask HQ has its own parking lot, but it will certainly not accommodate all guests. Please use parking lots available nearby: