This month we have Jonathan Stray presenting "Natural Language Processing for Investigative Journalism."
Journalists frequently have far too many documents to read manually, whether it's a 10,000 page response to a Freedom of Information Request or 250,000 leaked diplomatic cables. We've spent the last three years applying NLP and visualization techniques to this problem, building a system called Overview which has now been used by journalists all over the world. In this talk I'll show you exactly how Overview's language processing pipeline works.
But I'll also talk about how we decided which algorithms to use and how to present the results to the user. Topic modeling is a powerful technique, but all such algorithms are derived by optimizing for statistical properties, not fitness to end-user tasks. We developed Overview through extensive collaboration with journalists and careful user testing, and the experience has taught us a great deal about the problem of making NLP results interpretable to users.
Since Overview is open source, you can leverage our work to build your own user-friendly NLP applications with our plugin API.
Jonathan Stray is a computer scientist and a journalist. He is a Fellow at the Tow Center for Digital Journalism at Columbia University where he teaches computational journalism. He leads the Overview Project, an open source visualization system to help investigative journalists make sense of very large document sets. He has worked as an editor at the Associated Press, a freelance reporter in Hong Kong, and an algorithm designer for Adobe Systems.