Building A Gigaword Corpus:Data Ingestion,Mgmt,&Processing for NLP with Python

Abstract: Building A Gigaword Corpus: Data Ingestion, Management, and Processing for NLP with Python

As the applications we build are increasingly driven by text, the work of doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. This talk walks through the highs, the lows, and the new Python libraries we built to ingest and preprocess text for machine learning.

In the talk, I'll explain why building your own corpus is important for constructing language-aware data products and talk about the problems you're likely to encounter when building your own corpus. You'll get to hear about some of the mistakes we made and lessons we learned along the way, and find out how to leverage Python packages and other best practices to support your own custom corpus ingestion, management, loading, and preprocessing.
The talk is geared towards application developers who want to integrate text analytics features into their software, and Python programmers who have tinkered with NLP and machine learning and are interested in leveraging these tools with a custom corpus.

Bio: Dr. Rebecca Bilbro
Dr. Rebecca Bilbro is Lead Data Scientist at Bytecubed, where she builds data solutions for government and commercial clients using open source machine learning tools. Previously, she served as a data scientist at the Department of Commerce and the Department of Labor. Her PhD at the University of Illinois, Urbana-Champaign, and subsequent research with District Data Labs led to the development of the Yellowbrick Project, which is a new Python library for visual machine learning diagnostics. She is also coauthor of the forthcoming Applied Text Analytics with Python.

GWU Data Science Program

