For our first Meetup, DC NLP is proud to join our friends at Data Science DC to cross-list this free event! To participate, go to the Data Science DC page, join the group if you're not already a member, and RSVP for the event there.
For our April Meetup, we are excited to bring you an event themed around Big Data Week! We have two presenters talking about their work with very different, very large text data sets. First, Ben Bengfort from Unbound Concepts will be talking about how they use Python's NLTK and Hadoop Streaming to make sense of their corpus of children's literature. Then, Tom Rindflesch from the National Library of Medicine will talk about his group's work building a system to help medical researchers keep up with the flood of current and historical articles published on PubMed.
- Please check out the other events around Big Data Week DC here, and follow the #bdw13dc hash tag on Twitter!
- We're back at GWU for this event.
- 6:30pm -- Networking and Refreshments
- 7:00pm -- Introduction
- 7:15pm -- Presentations and discussion
- 8:30pm -- Post presentation conversations
- 8:45pm -- Adjourn for Data Drinks (location TBA)
Natural Language Processing of Big Data using NLTK and Hadoop Streaming
Many of the largest and most difficult to process data sets that we encounter during the course of big data processing tend not to be well structured log data or database row values, but rather unstructured bodies of text. In recent years, Natural Language Processing techniques have accelerated our ability to stochastically mine data from unstructured text and in fact require large training data sets themselves to produce meaningful results. Simultaneously the growth of distributed computational architectures and file systems have allowed data scientists to deal with large volumes of data; clearly there is common ground that can allow us to achieve spectacular results. The two most popular open source tools for both NLP and Distributed Computing, The Natural Language Toolkit and Apache Hadoop, are written in different languages -- Python and Java. We will discusses the methodology to integrate them using Hadoop’s Streaming interface which sends and receives data into and from mapper and reducer scripts via the standard file descriptors.
Semantic MEDLINE: An Advanced Information Management Application for Biomedicine
Semantic MEDLINE integrates information retrieval, advanced natural language processing, automatic summarization, and visualization into a single Web portal. The application is intended to help manage the results of PubMed searches by condensing core semantic content in the citations retrieved. Output is presented as a connected graph of semantic relations, with links to the original MEDLINE citations. The ability to connect salient information across documents helps users keep up with the research literature and discover connections which might otherwise go unnoticed. Semantic MEDLINE can make an impact on biomedicine by supporting scientific discovery and the timely translation of insights from basic research into advances in clinical practice and patient care.
Benjamin Bengfort is the CTO of Unbound Concepts, Inc, an EdTech company that creates individualized education outcomes for students learning to read by matching readers to educational content that is at the correct reading level for them. He makes use of a large corpora of children’s books to determine textual complexity via Machine Learning and Natural Language Processing techniques. He is currently in the middle of his PhD in Computer Science with a focus on NLP at the University of Maryland, Baltimore County, and has a MS in Computer Science from North Dakota State University.
Please follow Ben on Twitter at @bbengfort!
Thomas Rindflesch has a Ph.D. in linguistics from the University of Minnesota and conducts research in natural language processing in the Lister Hill Center for Biomedical Communications at the National Library of Medicine. He leads a research group that focuses on developing semantic interpretation of biomedical text and exploiting results in innovative informatics methodology for clinical practice and basic research. Recent efforts concentrate on supporting literature-based discovery.