Skip to content

Details

We're back in action for an April Orlando Data Science Meetup!

Mastering Map Reduce

Speaker: Scott Crespo

Today, awesome tools like Hive, Pig, and Spark take care of 99% of your parallel processing tasks - by making the "parallelness" of the problem mostly invisible to the developer.

Nonetheless, the remaining 1% of problems are the most fun! And, they tend to be the most critical components in a system - where performance and control are key.

Scott will cover various MR design patterns, common optimizations, and sweet data structures to boot. The examples are written in Python and Java, but anyone with a basic knowledge of programming and Hadoop will be able to follow along.

Important Updates

  1. Introduction from Susan Scrupski of Big Mountain Data

Susan has been very active using data for social change, and she'll be giving an introduction on her organization, Big Mountain Data, and it's latest initiatives. http://www.bigmountaindata.com/

2. Map Reduce Code Examples Available!

We'll be focusing on how to create custom data types with Hadoop Map Reduce, which can result in sizable performance gains and more reusable code. As an example, I've implemented a custom Trigram Counter in Java and a prototype in Python. We'll be going over the code in detail during the presentation.

You can get the code on my github http://www.github.com/scottcrespo/ngrams

The code currently works properly on my local machine running Hadoop 2.6.0 and Java 1.6. Leading up to the presentation, I'll be continuing to add documentation/annotations so it's easier to follow.

For those not familiar, Trigrams (NGrams) are a central component of statistical natural language processing. If you you're not familiar with Trigrams and NGrams, check out this link: http://en.wikipedia.org/wiki/N-gram

Members are also interested in