Past Meetup

HS Munich: Tensorflow and Hadoop, Accelarting HBase, Hive vs. Spark

This Meetup is past

48 people went

Details

Event to be held during the Hadoop Summit in Munich (not in Barcelona as regularly). This is a FREE event, you don't require any pass.

The event will be held in room 11a

Confirmed talks:

• Tensorflow and Hadoop: Experiences and Opportunities

• Accelerating HBase with NVMe and Bucket Cache

• Using BigBench to compare Hive and Spark versions and features

Agenda for Tuesday 4th:

18:00 - Arrive and meet members.

18:05 - Talk 1: Tensorflow and Hadoop: Experiences and Opportunities

18:40 - Talk 2: Accelerating HBase with NVMe and Bucket Cache

19:10 - Talk3: Using BigBench to compare Hive and Spark

20:00 - We go for beers and food (optional)

Talks and speakers info:

Talk 1: Tensorflow and Hadoop: Experiences and Opportunities

According to Andrej Kaparthy, there are four main factors holding back AI: Compute, Data, Algorithms, and Infrastructure. In this talk, we will show how we attack the Data and Infrastructure challenges for Deep Learning. Specifically, we will show how we integrated Tensorflow with the world's most scalable and human-friendly distribution of Hadoop, Hops ( http://www.hops.io ). Hops is a new European distribution of Hadoop with a distributed metadata architecture and 16X the performance of HDFS. Hops also includes a human-friendly UI, called Hopsworks, with support for the Apache Zeppelin Notebook. We will show how users can run tensorflow programs in Apache Zeppelin on huge datasets in Hops Hadoop. Moreover, we will show how Hopsworks makes discovering and downloading huge datasets a piece of cake with custom peer-to-peer sharing of datasets between Hopsworks clusters. A new user can, within minutes, install Hopsworks, discover curated important datasets and download them to train Deep Neural networks using Tensorflow. Hops is the first Hadoop distribution to support Tensorflow.

Talk 2: Accelerating HBase with NVMe and Bucket Cache

Non-Volatile-Memory express (NVMe) standard promises and order of magnitude faster storage than regular SSDs, while at the same time being more economical than regular RAM on TB/$. This talk evaluates the use cases and benefits of NVMe drives for its use in Big Data clusters with HBase and Hadoop HDFS.

First, we benchmark the different drives using system level tools (FIO) to get maximum expected values for each different device type and set expectations. Second, we explore the different options and use cases of HBase storage and benchmark the different setups. And finally, we evaluate the speedups obtained by the NVMe technology for the different Big Data use cases from the YCSB benchmark.

Talk 3: Using BigBench to compare Hive and Spark versions and features

BigBench is the brand new standard for benchmarking and testing Big Data systems. This talk first introduces BigBench and how problems can it solve. Then, presents both Hive and Spark benchmark results with their respective 1 and 2 versions under different configurations. Results are further classified by use cases, showing where each platform shines (or doesn't), and why, based on performance metrics and log-file analysis.

Speakers:

Talk 1: Jim Dowling is an Associate Professor at the School of Information and Communications Technology in the Department of Software and Computer Systems at KTH Royal Institute of Technology as well as a Senior Researcher at SICS - Swedish ICT. He received his Ph.D. in Computer Science from Trinity College Dublin, Ireland (2005) and his docenture from KTH - Royal Institute of Technology (2013). He is a distributed systems researcher and his research interests are in the area of large-scale distributed computer systems. He is the coordinator of the EU FP7 BiobankCloud project ( http://www.biobankcloud.eu ) that is developing Big Data support for Biobanking and Next-Generation Sequencing data. He is lead architect for the Hadoop Open Platform ( http://www.hops.io ), a next-generation Hadoop distribution.

Talk 2: Nicolas Poggi(@ni_po) (https://twitter.com/ni_po), is an IT researcher with focus on performance and scalability of Data intensive applications and infrastructures. He is currently leading a research project on upcoming architectures for Big Data at the Barcelona Supercomputing (BSC) and Microsoft Research joint center. Nicolas received his PhD in Distributed Systems and Computer Architecture at UPC/BarcelonaTech, where he is part of the HPC and of the Data Centric Computing research groups. He has also been a Research Scholar at IBM Watson, working in Big Data and system performance topics. Nicolas can usually be found speaking and organizing local IT meetup events.

Talk 3: Alejandro Montero, Research Engineer at Barcelona Supercomputing Center.