Past Meetup

Hadoop Summit Dublin: distro and ALOJA Big Data benchmarking talks

This Meetup is past

24 people went


Event to be held during the Hadoop Summit in Dublin (not in Barcelona as regularly). This is a FREE talk, you don't require any pass.

The room is at the second level of the CCD (room name ecocem).

Confirmed talks:

• Hops: Distributed Metadata for Hadoop

• Automating Big Data Benchmarking with ALOJA. Talk divided in 3 parts: benchmaking, deployment, and analytics (by 3 speakers).

Agenda for Tuesday 12:

18:00 - Arrive and meet members. There will be beers!

18:15 - Talk1: Hops: Distributed Metadata for Hadoop

19:20 - Talk2: Automating Big Data Benchmarking with ALOJA

20:00 - We go for beers and food (optional)

Talks and speakers info:

Talk1: Hops: Distributed Metadata for Hadoop

Hops ( ) is an open-source distribution of Apache Hadoop that supports distributed metadata for both the NameNode and the ResourceManager using a pluggable NewSQL distributed database backend. Hops provides an architecture for metadata that is both highly available and scales out, with fail-over for both NameNodes and ResourceManagers in a few seconds. HDFS supports multiple stateless NameNodes (including a leader NameNode) and YARN supports a single ResourceManager (as the scheduler) along with multiple ResourceTrackers for scaling-out communication with NodeManagers. We provide a generic API for NewSQL database backends, but our first release supports NDB (MySQL Cluster), an in-memory, distributed database that scales to 48 nodes and many TBs in size. We discuss the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database. Among these are supporting multi-tenancy at the HDFS level using privileges stored in the database, support for extensible metadata for files and directories, free-text search of HDFS metadata and extended metadata by exporting to ElasticSearch. For a workload from Spotify, Hops HDFS scales to handle 3 times the throughput of Apache HDFS, and Hops YARN scales to handle cluster twice the size of Apache YARN.

About the speaker: Jim Dowling is an Associate Professor at the School of Information and Communications Technology in the Department of Software and Computer Systems at KTH Royal Institute of Technology as well as a Senior Researcher at SICS - Swedish ICT. He received his Ph.D. in Computer Science from Trinity College Dublin, Ireland (2005) and his docenture from KTH - Royal Institute of Technology (2013). He is a distributed systems researcher and his research interests are in the area of large-scale distributed computer systems. He is the coordinator of the EU FP7 BiobankCloud project ( ) that is developing Big Data support for Biobanking and Next-Generation Sequencing data. He is lead architect for the Hadoop Open Platform ( ), a next-generation Hadoop distribution.

Talk2: Automating Big Data Benchmarking with ALOJA

Optimizing Big Data execution environments often requires extensive benchmarking and manually fine-tuning configurations parameters according to the underlying hardware and hours analyzing results. This workshop will give a hands-on experience on the different aspects to fully automate Big Data Benchmarking and Analysis of Hadoop and ecosystem applications using open source tools. To save tedious hours of manually processing data, doing it more efficiently, and at the same time to get the most value of Big Data infrastructures.

Talk divided in 3 parts: (by 3 speakers from the team).

• Benchmarking intro a DEMO

• Deployment basics and DEMO

• Performance Analytics with Machine Learning

Tools and results for the workshop comes from the ALOJA open source project (, an initiative of the Barcelona Supercomputing Center and Microsoft Research. ALOJA provides tools to automate the benchmarking-to-knowledge process, as well an online service to explore over 50k ready results featuring different applications, software configurations, data sizes, and more than 100 deployment options. Using a combination of slides and online demo, the talk will guide Big Data practitioners first over the benchmark repository, where users can quickly search for already performed benchmarks that resemble their infrastructures. Then on how to implement new benchmarks in the system or run custom jobs. The talk will end by briefly presenting the research Predictive Analytics features for modeling applications and predicting best deployment configurations to further automate the optimization process.

About the speaker: Nicolas Poggi(@ni_po), is an IT researcher with focus on performance and scalability of Data intensive applications and infrastructures. He is currently leading a research project on upcoming architectures for Big Data at the Barcelona Supercomputing (BSC) and Microsoft Research joint center. Nicolas received his PhD in Distributed Systems and Computer Architecture at UPC/BarcelonaTech, where he is part of the HPC and of the Data Centric Computing research groups. He has also been a Research Scholar at IBM Watson, working in Big Data and system performance topics. Nicolas can usually be found speaking and organizing local IT meetup events.