Natural Language Processing with Python/NLTK Workshop

Hosted by Seattle Data Geeks

Public group

This is a past event


Benjamin Bengfort (, Head Faculty for District Datalabs ( and co-author of the upcoming O'Reilly publication Data Analytics with Hadoop: An introduction for Data Scientists (, is coming to speak at Data Day Seattle ( We asked him, while he is in town, if he would take a day to offer his Natural Language Processing with Python Workshop. Fortunately, he agreed. Don't miss this opportunity!

Full details at registration at:


Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python and with the Natural Language Toolkit (NLTK).

NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.


In this course we will begin by exploring NLTK from the view of the corpora that it already comes with, and in this way we will get a feel for the various features and functionality that NLTK has. This will last us the first part of the course. However, most NLP practitioners want to work on their own corpora, therefore during the second half of the course we will focus on building a language aware data product from a specific corpus - a topic identification and document clustering algorithm from a web crawl of blog sites. The clustering algorithm will use a simple Lesk K-Means clustering to start, and then will improve with an LDA analysis.


The following represents the one-hour modules that will make up the course.

Part One: Using NLTK

Introduction to NLTK: code + resources=magicThe counting of things: concordances, frequency distributions, tokenizationTagging and parsing: PoS tagging, NERC, Syntactic ParsingClassifying text: sentiment analysis, document classification

Part Two: Building an NLP Data Product

Using the NLTK API to wrap a custom corpusWord vectors for K-Means clusteringLDA for topic analysis

Notably not mentioned: morphology, n-gram language models, search, raw text preprocessing, word sense disambiguation, pronoun resolution, language generation, machine translation, textual entailment, question and answer systems, summarization, etc.

After taking this workshop students will be able to create a Python module that wraps their own corpora and begin to leverage NLTK tools against it. They will also have an understanding of the features and functionality of NLTK, and a working knowledge of how to architect applications that use NLP. Finally, students who complete this course will have built an information extraction system that performs topic analyses on a corpora of documents.


This course is an intermediate Python course as well as an intermediate Data Science course. Students will be expected to have a beyond beginner knowledge and understanding of both Python and software development, as well as analytical and mathematical techniques used in Data Science. In particular, the students will be required to have the following knowledge, preparations before the course:

Python installed on their systemKnowledge of how to write and execute Python programsUnderstanding of how to use the command lineNLTK installed along with all corpora and NLTK DataKnowledge of the English language (adjectives, verbs, nouns, etc.) Basic probability and statistical knowledge.

Full details at registration at: (


Benjamin Bengfort is a Data Scientist who lives inside the beltway but ignores politics (the normal business of DC) favoring technology instead. He is currently working to finish his PhD at the University of Maryland where he studies machine learning and distributed computing. His focus is on highly consistent local distributed storage and visual diagnostics for data modeling. The lab next door does have robots and, much to his chagrin, they seem to constantly arm said robots with knives and tools; presumably to pursue culinary accolades. Having seen a robot attempt to slice a tomato, Benjamin prefers his own adventures in the kitchen where he specializes in fusion French and Guyanese cuisine as well as BBQ of all types. A professional programmer by trade, a Data Scientist by vocation, Benjamin's writing pursues a diverse range of subjects from Natural Language Processing, to Data Science with Python to analytics with Hadoop and Spark.

Attendees (1)