Joint Meetup with Statistical Programming DC and Data Wranglers DC

This is a past event

Location image of event venue


Please RSVP at the Statistical Programming DC event page:

Our July 2014 meetup will be a joint event co-hosted by Statistical Programming DC ( and Data Wranglers DC ( All three groups share a common interest in NLP. We will be talking about the nuts and bolts of natural language processing using both Python/NLTK and R.

Since we are expecting a large crowd for this three-way joint event, we'll be meeting at a special venue: Funger Hall at George Washington University.


A Smattering of NLP in Python

Back in the dark ages of data science, each group or individual working in Natural Language Processing (NLP) generally maintained an assortment of homebrew utility programs designed to handle many of the common tasks involved with NLP. Despite everyone's best intentions, most of this code was lousy, brittle, and poorly documented -- not a good foundation upon which to build your masterpiece. Fortunately, over the past decade, mainstream open source software libraries like the Natural Language Toolkit for Python (NLTK) have emerged to offer a collection of high-quality reusable NLP functionality. These libraries allow researchers and developers to spend more time focusing on the application logic of the task at hand, and less on debugging an abandoned method for sentence segmentation or reimplementing noun phrase chunking.

This talk will cover a handful of the NLP building blocks provided by NLTK, including extracting text from HTML, stemming & lemmatization, frequency analysis, and named entity recognition. These components will then be assembled to build a very basic document summarization program -- live on stage! Source code for all of the examples in this presentation will be available on GitHub.

About the Speaker: Charlie Greenbacker ( is Director of Data Science at Altamira Technologies Corporation ( and co-organizer of the DC NLP meetup group (


NLP basics in R

R is known to be a powerful language for statistics, but it also has functionality for many NLP tasks and language models. R has established frameworks for working with text, constructing document term matrices, and other common linguistic methods. However, R does have two limitations: it requires all data in the workspace to be held in RAM and its abilities can be weak outside of statistical applications. Tommy will give a brief overview of packages and frameworks available in R that are oriented towards NLP. He will then give more detailed examples and demonstrations of how to construct a document term matrix efficiently and then leverage R's true muscle, statistical analyses, on the constructed data.

About the Speaker: Tommy Jones ( is Research Associate Statistician at the Institute for Defense Analyses - Science and Technology Policy Institute (


Please RSVP at the Statistical Programming DC event page:

Attendees (1)