We're thrilled to have Jeff Larson, data editor Data Editor at ProPublica, presenting On the resemblance and containment of documents (http://gatekeeper.dec.com/ftp/pub/dec/SRC/publications/broder/positano-final-wpnums.pdf) by Andrei Z. Broder.
Increasingly Journalists are dealing with ever larger document dumps, and in order to find interesting stories in these troves, they have to cluster the documents to separate the wheat from the chaff. The size of these dumps often means that traditional algorithms either are too complex and take too long, or they rely on apriori constants like the number of clusters to search for.
Jeff Larson will present a novel algorithm called minhashing that was invented at AltaVista in order to loosely cluster similar documents. The paper " On the resemblance and containment of documents" relies on a hash collisions to create document fingerprints and shows that documents can be clustered in linear time without knowledge of the entire document corpus. This algorithm has been a key tool in some of ProPublica's biggest investigations, and has allowed reporters to shine light on topics such as political astroturfing and international money laundering.
Jeff Larson (@thejefflarson (https://twitter.com/thejefflarson)) is ProPublica's data editor and winner of the 2011 Livingston Award for the series "Redistricting: how Powerful Interests are Drawing You Out of a Vote (http://www.propublica.org/series/redistricting)." He was on the team reporting on the Snowden files in 2013 with the Guardian and the New York times, and was the lead reporter behind the NSA stories "The NSA’s Secret Campaign to Crack, Undermine Internet Security (http://www.propublica.org/article/the-nsas-secret-campaign-to-crack-undermine-internet-encryption)" and "Spy Agencies Probe Angry Birds and Other Apps for Personal Data (http://www.propublica.org/article/spy-agencies-probe-angry-birds-and-other-apps-for-personal-data)."
Doors open at 7 pm; the presentation will begin at 7:30 pm; and, yes, there will be beer, water, and pizza.
After Jeff presents the paper, we will open up the floor to discussion and questions.
We hope that you'll read the paper before the meetup, but don't stress if you can't. If you have any questions, thoughts, or related information, please visit our *github-thread (https://github.com/papers-we-love/papers-we-love/issues/197)* on the matter.
Additionally, if you have any papers you want to add to the repository above (papers that you love!), please send us a pull request (https://github.com/papers-we-love/papers-we-love/pulls). Also, if you have any ideas/questions about this meetup or the Papers-We-Love org, just open up an issue.
December's meetup is sponsored by KISSPatent (http://kisspatent.com/) and The Ladders (https://www.theladders.com/) (@TheLaddersDev (https://twitter.com/theladdersdev))