Computational Linguistics, Machine Learning, and Text Mining with Groovy (aka Java++)
This talk is a brief introduction to computational linguistics by way of looking at some projects I've worked on over the last year demonstrating Information Extraction, Lexical Analysis (the linguistics rather than compiler kind), an Ensemble Method for Statistical Parsing, and Corpus Construction (which includes parsing some English extracted from Javadoc). In each case I'll give a very brief description of the motivating problem and then dive into code that deals with some part of it. We'll see how Groovy makes the most out of Java for text processing, simple web browser UIs, and cluster computing. This talk should be of interest both to those curious about what goes on in NLP as well as those who would simply like to get some of their work done faster by using more powerful tools.
Technologies we'll see at work (all of which are Open Source Software):
Stanford CoreNLP http://nlp.stanford.edu/software/corenlp.shtml
GATE (General Architecture for Text Engineering) http://gate.ac.uk
MALLET (MAchine Learning for LanguagE Toolkit) http://mallet.cs.umass.edu
DELPH-IN PET http://moin.delph-in.net/PetTop
ERG (English Resource Grammar) http://erg.delph-in.net/logon
Jim White is a computational linguist with over 30 years of experience building computer systems (resume). Prior to focusing on Natural Language Processing (NLP) he worked at the software, firmware, hardware, and system architecture level in development tools, embedded and portable devices, networking, and graphics. He is an Open Source Software advocate, Groovy committer, and has created the innovative OSS Groovy for OpenOffice and IFCX Wings. He is currently working on a thesis for the Master of Science in Computational Linguistics (CLMS) at the University of Washington and was the instructor for the program's Computational Linguistics Fundamentals course this year.