For our April Meetup, we are excited to bring you an event themed around Big Data Week! We have two presenters talking about their work with very different, very large text data sets. First, Ben Bengfort from UMBC and Full Stack Data Science will be talking about how to use Python's NLTK and Hadoop Streaming to make sense of large text corpora. Then, Tom Rindflesch from the National Library of Medicine will talk about his group's work building a system to help medical researchers keep up with the flood of current and historical articles published on PubMed.
- Please check out the other events around Big Data Week DC, and follow the #bdw13 hash tag on Twitter!
- We're very happy to have new Meetup DC NLP cross-listing this event! Welcome to folks coming from DC NLP! Members of DSDC interested in Natural Language Processing should definitely consider joining DC NLP.
- We're back at GWU for this event.
- 6:30pm -- Networking and Refreshments
- 7:00pm -- Introduction
- 7:15pm -- Presentations and discussion
- 8:30pm -- Post presentation conversations
- 8:45pm -- Adjourn for Data Drinks (Tonic, 22nd & G St., space reserved!)
Natural Language Processing of Big Data using NLTK and Hadoop Streaming
Many of the largest and most difficult to process data sets that we encounter during the course of big data processing tend not to be well structured log data or database row values, but rather unstructured bodies of text. In recent years, Natural Language Processing techniques have accelerated our ability to stochastically mine data from unstructured text and in fact require large training data sets themselves to produce meaningful results. Simultaneously the growth of distributed computational architectures and file systems have allowed data scientists to deal with large volumes of data; clearly there is common ground that can allow us to achieve spectacular results. The two most popular open source tools for both NLP and Distributed Computing, The Natural Language Toolkit and Apache Hadoop, are written in different languages -- Python and Java. We will discusses the methodology to integrate them using Hadoop’s Streaming interface which sends and receives data into and from mapper and reducer scripts via the standard file descriptors.
Semantic MEDLINE: An Advanced Information Management Application for Biomedicine
Semantic MEDLINE integrates information retrieval, advanced natural language processing, automatic summarization, and visualization into a single Web portal. The application is intended to help manage the results of PubMed searches by condensing core semantic content in the citations retrieved. Output is presented as a connected graph of semantic relations, with links to the original MEDLINE citations. The ability to connect salient information across documents helps users keep up with the research literature and discover connections which might otherwise go unnoticed. Semantic MEDLINE can make an impact on biomedicine by supporting scientific discovery and the timely translation of insights from basic research into advances in clinical practice and patient care.
Benjamin Bengfort is a Data Science consultant at Full Stack Data Science, and has used Machine Learning and Natural Language Processing techniques to determine textual complexity in large literary corpora. He is a PhD candidate in Computer Science, with a focus on NLP, at the University of Maryland, Baltimore County, and has a MS in Computer Science from North Dakota State University.
Please follow Ben on Twitter at @bbengfort!
Thomas Rindflesch has a Ph.D. in linguistics from the University of Minnesota and conducts research in natural language processing in the Lister Hill Center for Biomedical Communications at the National Library of Medicine. He leads a research group that focuses on developing semantic interpretation of biomedical text and exploiting results in innovative informatics methodology for clinical practice and basic research. Recent efforts concentrate on supporting literature-based discovery.