Past Meetup

23. High performance data flow & Multi-tenant Hadoop-as-a-Service

This Meetup is past

210 people went

Location image of event venue



• 17.45: drink, socialize

• 18.00: first talk: High performance data flow with a GUI, and guts

Speaker: Simon Ball - Principal Solutions Engineer at Hortonworks.

Simon is a Principal Solutions Engineer at Hortonworks, where he helps clients do Hadoop. He is a certified Spark and Hadoop developer. Previously he has worked in the data intensive worlds of hedge funds and financial trading, ERP and e-Commerce, as well as designing and running nationwide networks and websites.

In the course of those roles, he’s designed and built several organisation-wide data and networking infrastructures, headed up research and development teams, and designed (and implemented) numerous digital products and high-traffic transactional websites.

For a change of technical pace, he hacks on Spark and Machine Learning, and tries to teach his home automation system to behave itself.

Abstract: Apache NiFi has seen it all. It worked for the NSA after all. What it brings to the Hadoop eco-system is a series of data flow and ingest patterns, a GUI, and a lot of security and record level data provenance.

This is a look under the covers of Apache NiFi and its innovations around content and provenance repositories. The focus is on how NiFi achieves what it does in terms of throughput and performance, and a deep dive into the internal data structures and code that allow you to make tradeoff between latency and throughput, or resilience and speed in realtime.

We will also look at pulling apart some of the key processors that make up NiFi data flows, and examining the clues they leave to writing high performance data flows on top of the NiFi framework.

• 18.45: eat, drink, socialize (more)

• 19.00: second talk: A Swedish First: Multi-tenant Hadoop-as-a-Service

Speaker: Jim Dowling - Distributed systems researcher at SICS Swedish ICT and an Associate professor (docent) at KTH.

Dr. Jim Dowling is a distributed systems researcher at SICS Swedish ICT and an Associate professor (docent) at KTH. He has a strong interest in applying mechanisms from complex systems and self-organization to build better computer systems. In general, his research has concerned improving fundamental properties of software systems, including reliability, availability, persistence, security, scalability and performance.

Abstract: Hadoop Open Platform-as-a-Service (Hops) is a new distribution of Apache Hadoop that is based on a next-generation, scale-out architecture for HDFS and YARN metadata. From February 2016, Hops has been provided as software-as-a-service for researchers and companies in Sweden from the Swedish ICT SICS Data Center ( ).

One of the goals of Hops is to make Hadoop easier to use for researchers that may not be data engineers. To this end, we have developed a new user interface to Hops, called HopsWorks, that supports true multi-tenancy in Hadoop. That is, researchers and companies can securely share the same Hadoop cluster resources. This contrasts with existing models for multi-tenancy in Hadoop that limit organizations to running separate Hadoop clusters on virtualized or containerized platforms. Our model for multi-tenancy is based around projects. Users can create projects, manage CPU quotas and disk quotas for projects, control membership of projects, and securely share data between projects. Users in HopsWorks can also make use of data analytics frameworks such as Apache Spark, Apache Flink, MapReduce as well as user-interface driven services such as Apache Zeppelin for interactive analytics, ElasticSearch for free-text search for files and directories in HDFS, as well as extended metadata. Hops and HopsWorks are open-source projects and have has been developed at the Distributed Computing Lab, a collaboration between Swedish ICT SICS and KTH.

• 19.45: drink, socialize (even more)

Follow SHUG on twitter (!