Large Scale Image Classification and Apache Spark for applied machine learning


Details
Ready for the third meetup in our series? This time we are gathering in Club Dauphine, which is kindly offered to us by eBay Classifieds Group / Marktplaats (https://www.marktplaats.nl/i/help/over-marktplaats/werk-bij.dot), who will also be sponsoring our food and beverages for the evening.
We will have two talks again. We are still in the process of confirming the first talk, but right now it looks like it could be a introduction to using Apache Spark (http://spark.apache.org/) for applied machine learning. For the second talk, we have confirmed Thomas Mensink who is working on machine learning and computer vision at the University of Amsterdam.
Agenda
• 18.00: Arrive, socialise, have a drink and eat
• 18.50: Short introduction by your humble organizers
• 19.00: Talk 1, by yours truly (i.e. Friso), organiser @ The Amsterdam Applied Machine Learning meetup group
Using Apache Spark for applied machine learning and other data tasks
The open source Apache Hadoop stack, including its MapReduce batch processing framework, has over the past few years more or less become the de facto standard for large scale data processing needs in commercial organisations. One of the draw backs of this solution is the fact that it is most efficient for single pass, batch data processing, because it synchronises to disk one or more times during a MapReduce program. For many machine learning and other data driven applications, this is a major performance bottleneck as many algorithms in applied machine learning require multiple passes over the same data. The Apache Spark project aims to address this problem by using aggregate cluster memory to store datasets allowing multiple iterations more suited for iterative algorithms and exploratory analysis.
In this talk we'll take a look at Apache Spark for exploratory analysis using Python and iPython notebook integration as well as implementing a iterative machine learning algorithm using the native Scala API for Apache Spark. (Note: I went beyond the built in MLlib and implemented something from scratch.)
• 19.45: short break
• 20.00: Talk 2, by Thomas Mensink, postdoctoral researcher at the University of Amsterdam
Large Scale Image Classification and Generalizing to New Classes
In this talk I'll present recent research on large scale image classification and how to learn classifiers for new classes at negligible cost.
First, I'll give a brief overview of the Fisher Vector (FV) image representation. The FV framework could be seen as a generalization of the popular Bag-of-Visual words approach, by taking into account more statistics about the distribution of the local descriptors in the image. This representation has many advantages: it is efficient to compute, it leads to excellent results even with efficient linear classifiers, and it can be compressed with a minimal loss of accuracy using product quantization.
Second, I'll discuss distance based classifiers, such as the k-Nearest Neigbours (kNN) and Nearest Class Means (NCM), since these methods can incorporate new classes and training images continuously over time at negligible cost. This is not possible with the popular one-vs-rest SVM approach, but is essential when dealing with real-life open-ended datasets. For the NCM classifier, which assigns an image to the class with the closest mean, we introduce a new metric learning approach based on multi-class logistic discrimination. During training we enforce that an image from a class is closer to its class mean than to any other class mean in the projected space. Experiments on the ImageNet 2010 challenge dataset, which contains over 1 million training images of thousand classes, show that, surprisingly, the NCM classifier compares favorably to the non-linear k-NN classifier. Moreover, the NCM performance is comparable to that of linear SVMs which obtain current state-of-the-art performance. Experimentally we also study the generalization performance to classes that were not used to learn the metrics and obtain surprisingly good results.
• 20.45: more drinks and social talks
• 21.30 or whenever the bar closes: everybody out! (out of the room, that is; the bar in Dauphine itself will be happy to serve you)

Large Scale Image Classification and Apache Spark for applied machine learning