I'm a huge fan of Daniel Lemire's work on compressed bitmaps, sorted sequences of integers, and other data structures. I'd say half the papers I have on my desk at work right now have his name on them. I've invited him to speak at PWLMTL on his own work, rather than on a paper he loved, because I love his papers: I find them to be practical and accessible to the working developer in a way that few academic works manage to be.
He's going to speak about compressed bitmaps, in particular the Roaring bitmap approach he first published in this paper (http://r-libre.teluq.ca/602/1/RoaringBitmap.pdf) (cited below).
The bitmap, especially in compressed form, is a very versatile data structure despite its apparent simplicity, and I invite all developers to consider it for many purposes.
Just recently, I used compressed bitmaps in an application which needed to represent large sets of IP addresses efficiently. In the last few years, I've used them in the implementation of a database system, an IMAP server, and a pubsub messaging system. (Not to mention their ubiquity, uncompressed, in allocators of all sorts.)
AdGear will be hosting us again for this meetup, and beer and pizza will be provided; feel free to come around 6 to eat and chat before the talk!
An abstract for the talk follows:
Better bitmap performance with Roaring bitmaps
Bitmaps are used to implement fast set operations in software. They are frequently found in databases and search engines.
Without compression, bitmaps scale poorly, so they are often compressed. Many bitmap compression techniques have been proposed, almost all relying primarily on run-length encoding (RLE). For example, Oracle relies on BBC bitmap compression while the version control system Git and Apache Hive rely on EWAH compression.
We can get superior performance with a hybrid compression technique that uses both uncompressed bitmaps and packed arrays inside a two-level tree. An instance of this technique, Roaring, has been adopted by several production platforms (e.g., Apache Lucene/Solr/Elastic, Apache Spark, eBay's Apache Kylin and Metamarkets' Druid).
Overall, our implementation of Roaring can be several times faster (up to two orders of magnitude) than the implementations of traditional RLE-based alternatives (WAH, Concise, EWAH) while compressing better. We review the design choices and optimizations that make these good results possible.
One interesting feature of this work is that it has been out in the open from the start. Roaring bitmaps is the result of a far ranging collaborative effort. We believed that it played a role in its success.
If time allows, ongoing and future development efforts will be discussed.
About the speaker
Daniel Lemire is a computer science professor at the Université du Québec. He has also been a research officer at the National Research Council of Canada and an entrepreneur. He has written over 45 peer-reviewed publications, including more than 25 journal articles. He has held competitive research grants for the last 15 years. He has been an expert on several committees with funding agencies (NSERC and FQRNT). He has served as program committee member on leading computer science conferences (e.g., ACM CIKM, ACM WSDM, ACM SIGIR, ACM RecSys). His research interests include databases, information retrieval and high-performance programming. He blogs regularly on computer science at http://lemire.me/blog/.
Samy Chambi, Daniel Lemire, Owen Kaser, Robert Godin, Better bitmap performance with Roaring bitmaps, Software: Practice and Experience (to appear) http://onlinelibrary.wiley.com/doi/10.1002/spe.2325/abstract (available from the publisher under open access)
Daniel Lemire, Gregory Ssi-Yan-Kai, Owen Kaser, Consistently faster and smaller compressed bitmaps with Roaring (working paper)