Time: Saturday 23.
The idea is to do real programming activity on big data analysis tools like hadoop and Scala, including Scalding.
However, we will focus on the algorithms side of the map-reduce instead of on the nitty-gritty of hadoop and Scalding. We'll have a look at the algorithms starting from word count and simple statistic analysis (mean, standard deviation, ...), graph algorithms and social network analysis, machine learning.
Note that, we will unlikely to have experts on those fields, we will just have some facilitators, so don't expect something very precise. Bear with us on this. The idea is really to try to implement some algorithms, however imprecise it is, using map reduce.
At the end of the lab, participants have coded non-trivial map-reduce algorithms. Non-trivial is defined as iterative map-reduce algorithms or non-trivial problems like classification and clustering problem from machine learning.
PROGRAMMING LANGUAGES & LIBRARY
- We're going to code in Scala and Java. Typical application would be to write driver in Scala while mapper and reducer will be still written in Java.
- Hadoop distribution to be used is Cloudera CDH4.
- We may also play with Scalding or Scoobi (scala apis for Hadoop). You can just take the latest version of those.
Five algorithms are targeted (but not limited to. If you want to go beyond this list during the session, feel free to go ahead).
- Word count, for warming up. This can be expanded to bigram count or N gram count.
- Naive bayes.
- Logistic regression
- K-Means Clustering and Beyond
- Shortest path / Triangle Calculation.
Each problem is going to be exhibited in the presentations. Each presentation goes until quite detail so that the participants can translate it into the codes.
- basic linux shell knowledge
- basic JAVA knowledge
- macOS/Linux laptop with JVM 6 and ssh installed.
- 2GB RAM, wifi network card.
- few GB free HDD space.
Windows support is possible, but macos/linux is better. if you plan using windows (nobody is perfect!), you need 4GB RAM and be ready to install virtualbox for runing a linux VM.
We will have some small data to play for some algorithms, and bigger data to have the feeling how it looks like to be executed in a more realistic data.
Free of charge of course.
We're looking for sponsors for pizzas. In case we don't manage to have the sponsors, we will need to order pizzas and share the cost together (to see at D day ).
We will have one Amazon EC2 clusters shared by everybody. If you want to have your own Amazon EC2 clusters, you're very welcome.
Here is the confirmed schedule :
09.30 - 10.00 Opening
10.00 - 11.00 Map Reduce Refresher and 1st Algorithms: Word count and Beyond (Paul de Schacht)
11.00 - 12.30 Installation & implementation of word count.
12.30 - 13.45 Pizzas.
13.45 - 14.45 Classification Algorithms (Mario Pastorelli)
15.00 - 16.00 Clustering (Nicolas Maillot)
16.05 - 16.25 Just Enough Scala to Survive (Tobo Atchou)
16.00 - 20.00 Machine Learning coding + pizza starting at 18.30
20.00 - 20.45 Introduction to Map Reduce Graph Programming (shortest path and minimum spanning tree) (Anwar Rizal)
20.45 - 21.00 Conclusion (Anwar Rizal).
21.00 - 22.00 Bootstrapping graph algorithms, free, home, coffee, finishing pizzas ...