We will be talking about the nuts and bolts of natural language processing using both Python/NLTK and R. Charlie Greenbacker (Altamira Technologies, DC NLP) and Tommy Jones (IDA) will lead the Python and R bits, respectively. Abstracts are below.
This will be a joint meetup hosted by SPDC and co-sponsored by DC NLP (http://www.meetup.com/DC-NLP/) and Data Wranglers DC (http://www.meetup.com/Data-Wranglers-DC/), since NLP has a lot of common interests. Thanks to Charlie and Robert for agreeing to this.
A Smattering of NLP in Python
Back in the dark ages of data science, each group or individual working in Natural Language Processing (NLP) generally maintained an assortment of homebrew utility programs designed to handle many of the common tasks involved with NLP. Despite everyone's best intentions, most of this code was lousy, brittle, and poorly documented -- not a good foundation upon which to build your masterpiece. Fortunately, over the past decade, mainstream open source software libraries like the Natural Language Toolkit for Python (NLTK) have emerged to offer a collection of high-quality reusable NLP functionality. These libraries allow researchers and developers to spend more time focusing on the application logic of the task at hand, and less on debugging an abandoned method for sentence segmentation or reimplementing noun phrase chunking.
This talk will cover a handful of the NLP building blocks provided by NLTK, including extracting text from HTML, stemming & lemmatization, frequency analysis, and named entity recognition. These components will then be assembled to build a very basic document summarization program -- live on stage!Source code for all of the examples in this presentation will be available on GitHub.
About the Speaker: Charlie Greenbacker (https://twitter.com/greenbacker) is Director of Data Science at Altamira Technologies Corporation (https://www.altamiracorp.com/) and co-organizer of the DC NLP meetup group (http://dcnlp.org/).
NLP basics in R
R is known to be a powerful language for statistics, but it also has functionality for many NLP tasks and language models. R has established frameworks for working with text, constructing document term matrices, and other common linguistic methods. However, R does have two limitations: it requires all data in the workspace to be held in RAM and its abilities can be weak outside of statistical applications. Tommy will give a brief overview of packages and frameworks available in R that are oriented towards NLP. He will then give more detailed examples and demonstrations of how to construct a document term matrix efficiently and then leverage R's true muscle, statistical analyses, on the constructed data.
About the Speaker: Tommy Jones (https://twitter.com/thos_jones) is Research Associate Statistician at the Institute for Defense Analyses - Science and Technology Policy Institute (https://www.ida.org/stpi.php).