Past Meetup

SHUG 9. Dealing with Dirty Data (+) Text analytics in IBM's big data solutions

This Meetup is past

118 people went

Location image of event venue


We are happy to invite you to the 9th meeting of Stockholm Hadoop User Group! We will have two talks!

Please find the details below:

Title: Dirty data: dealing with substantial volume external sources

Speaker: Friso van Vollenhoven


At a large European publisher, we had the challenge of receiving external data from 10+ different sources, adding up to tens of GBs per day. Add to this that the file formats will sometimes change without notifications and that sometimes connections go bad or files go missing. This is while trying to maintain that at least the amount of data is near correct in an environment where the 'correct' amount of data for a source is often a difficult to predict number somewhere between 20M and 50M records for a particular day.

We built a extracting and loading pipeline to get data into Hadoop en expose it via Hive tables, which includes scheduling, reporting, monitoring, transforming and, above all, the ability to respond to changes very quickly. After all, responding to a file format change within the same day or adding a new source in a day are very reasonable user requests (right). We were focused on developer friendliness and rely on a fully open source stack, using Hadoop, Hive, Jenkins, various scripting languages and more. This is my talk about the setup and our lessons learned.

In our quest for data quality, we also did work on attempting to predict the expected data volumes, based on seasonality and weather information, in order to proactively alert when a data import appears to fall short of the expected volume. I will include these results in the talk.

Bio: Friso is a developer who has lately been setting up and using Hadoop a lot for a living. Also, he is a trainer teaching the Cloudera Hadoop developer classes and (co-)organizer of the Dutch Hadoop community meetup (NL-HUG) and the Dutch NoSQL NL meetup.


Title: Text analytics capabilities in IBM's big data solutions

Speaker: Claus Samuelsen

A huge part of the data stored in Hadoop are text based, so an efficient text analytic capability is often required. IBM's System-T is a Natural Language Processing (NLP) system with an easy high-level programming language AQL, that is executed as MapReduce processes in the Hadoop system.
In this speech I will do a presentation of the Annotation Query Language, the development environment and how to execute the code in Hadoop. I will show different usage examples, e.g. how to make a sentiment analysis solution.

Bio: Claus Samuelsen is a Technical Sales Professional working for IBM with big data solutions. Claus has worked on several Hadoop projects across Europe.


Additional information

RSVP to the meetup

Please RSVP to this meetup, since we need to put everybody on a guest list for entering the Spotify office. The event will be held in the cafeteria of the Spotify office, so don’t go to the normal entrance but to the 11th floor.

Pizza and drinks

Thanks to Spotify, pizza and beverages will be available for the participants during the meetup. This is another reason to RSVP to this meetup, if you are willing to come - it will help us to estimate the number of pizzas and drinks based on declared attendance.

The entrance

The door will be open between 17:45 and 18:15* Because of fire regulations, we need to keep a list of everybody in the building, so please make sure that you get your name ticked off the list at the entrance or (in case of a +1), make sure that the person at the door puts your name on the list.

*Unfortunately we can not leave the door open all the time (the company security policy), nor have a person that will be constantly watching for guests coming late. If you need to come later, please let us know in the comments below, so that somebody will come to the door to open it a given time.

See you at soon!