ParisDataGeeks April Second Time @Criteo


Details
The next edition of ParisDataGeeks is almost upon us. We would like to thank Criteo for welcoming us again!
http://photos2.meetupstatic.com/photos/sponsor/8/b/3/a/iab120x90_1955642.jpeg
This time we are going for a slower pace with only three presentations.
TL;DR
Ted Dunning - MapR : Anomaly Detection (English)
Guillaume Pitel - Exensa : Apache Spark : a practical feedback after implementing a data analysis workflow (French)
Sofian Djamaa and Rémy Pecqueur - Criteo : Parquet format de stockage Hadoop orienté colonnes: Théorie et application (French)
Complete Abstracts :
1- Ted Dunning - MapR : Anomaly Detection
Ted Dunning has been involved with a number of startups with the latest being MapR Technologies where he is Chief Application Architect working on advanced Hadoop-related technologies. He is also a PMC member for the Apache Zookeeper and Mahout projects. Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related communities and loves helping projects get going with new technologies.
The basic ideas of anomaly detection are simple. You build a model and you look for data points that don’t match that model. Building a practical anomaly detection system requires deal with practical details starting with algorithm selection, data flow architecture, anomaly alerting, user interfaces and visualizations. We will describe the major classes of anomaly detection systems and show how to build anomaly detection systems for:
a) rate shifts to determine when events such as web traffic, purchases or process progress beacons shift rate
b) topic spotting to determine when new topics appear in a content stream such as Twitter
c) network flow anomalies to determine when systems with defined inputs and outputs act strangely.
While describing how to solve these problems, we will describe how clustering, dimensionality reduction, and density estimation can be used in systems that adapt and learn about their environment and how these systems can tell you when something has changed.
This talk will reprise the content of my Strata presentation, but will include extra material that shows how compression equals truth and how anomaly detection can make databases faster among other sundry philosophical truths.
2 - Guillaume Pitel - Exensa : Apache Spark : a practical feedback after implementing a data analysis workflow
Within a few months, we have rewritten the complete workflow for a data analysis engine: eXenGine. We'll give our feedback about using Apache Spark for implementing a proprietary matrix factorization method and analyzing Wikipedia for textual content, links and meta-data. Focus will be on the nice things we have found about Spark, and the little quirks and flaws we have been facing when dealing with 50GB of raw text on a small cluster.
3 - Sofian Djamaa and Rémy Pecqueur - Criteo : will tell us about their recent contribution to the Parquet storage format for Hadoop.
Parquet format de stockage Hadoop orienté colonnes: Théorie et application
Criteo possède des Pétaoctets de données stockées dans son cluster Hadoop, analysées quotidiennement par le business et les opérationnels au travers d'outils comme Cascading ou Hive. Le requêtage des Pétaoctets pouvant prendre plusieurs minutes, nous souhaitions améliorer ces performances afin de fournir des réponses plus rapide au métier. Jusqu'à récemment, le format de stockage utilisé était à 100% du RCFile. Nous allons discuter des détails de la migration vers Parquet avec à l'appui des comparatifs sur la mémoire et le temps cpu consommés.
Parquet est un format de fichier orienté colonnes pour Hadoop développé par Cloudera et Twitter avec la contribution de Criteo.
Les performances et les bénéfices de la compression liés à l'utilisation d'un format de stockage par colonnes pour le stockage et le traitement d'un large volume de données sont largement bien documentés dans la littérature académique ainsi que dans plusieurs solutions open-sources et commerciales telles que les HBase ou Vertica.
Parquet applique ces principes au stockage Hadoop : structures de données à plusieurs niveaux (schéma), encodage et compression des colonnes efficaces et compatible avec un panel d'applications (MapReduce, Hive, Cascading...).
http://photos4.meetupstatic.com/photos/sponsor/3/5/7/c/iab120x90_2113692.jpeg (http://www.infoq.com/fr/)
We know it is frustating for a lot of you who can not come due to the limited place we have... so we have listened to you and we are proud and happy to announce we have partnered with InfoQ to bring you the integral videos from the sessions + exclusive interviews on the web!
http://photos1.meetupstatic.com/photos/sponsor/8/f/c/2/iab120x90_1656802.jpeg (http://dotscale.eu)
Our partner dot Scale - The Tech Conference to supersize your apps ! a world class event held in Paris in the incredible "Théâtre de Paris" are offering DataGeeks participants a special 15% off reduction! Among others you can listen to Paul Mockapetris (of DNS fame), Jeremy Edberg (Netflix), Matthew Ahrens (ZFS, OpenZFS), Thomas S. Hatch (SaltStack) and Mitchell Hashimoto (Vagrant, Packer, & Serf). Go to http://dotscale.eu to register and use DATAGEEKS as your promotion code!

Sponsors
ParisDataGeeks April Second Time @Criteo