Natural Language Processing with R


Details
Natural Language Processing with R
CHANGE (https://www.meetup.com/DataScienceTLH/events/238894642/#)
https://secure.meetupstatic.com/photos/event/d/e/d/5/600_461037045.jpeg
Agenda:
6:00pm - 6:30pm:- Pizza, Drink and Socialize.
6:30pm - 7:45pm:- Presentation, Question & Answer.
7:45pm - 8:00pm:- Discussions
Speaker Bio:
Dr. Mark Jack (https://www.linkedin.com/in/mark-jack-9445908/)is an experienced Data Scientist and Associate Professor of Physics at Florida A&M University with several years of experience in computational modeling in particle physics, neuroscience, nanoscience and high-performance computing. He is a certified trainer in machine learning and statistical programming in R. He has spoken in several Data Science conferences which includes the Global Big Data Conferences in Tampa, Fl and Atlanta, GA.
Abstract:
The creation of a corpus of documents from three text data files mostly relies on the use of the library ‘quanteda’ in R. It allows to quickly tokenize the corpus of documents to remove text features such as punctuation, numbers, white space, lowercase words etc. The processing time for the complete text data is considerable. Thus, a corpus is only created for a sample of the documents. Unigrams, bigrams, trigrams and quadgrams are generated via ‘quanteda’s’ format of a document-frequency matrix (dfm). A dfm allows for quick and easy analysis of the most frequently occurring ngrams. Additional Kneser-Ney smoothing enhances prediction accuracy by adjusting ngram probabilities by accounting for missing (unseen) ngrams in the corpus.
I will describe a word prediction algorithm that was deployed as web application as a capstone project in natural language processing. A table of sorted continuation probabilities had to be computed from n-grams (unigrams, bigrams, trigrams and 4-grams) derived from a corpus of three different documents (online twitter, blog and news data). An estimation had to be obtained from a smaller sample size (1% of the total corpus) to be able to be manageable with available computer memory. Appropriate libraries like ‘tm’ and ‘quanteda’ were used to create ‘dfm’ data frames and thus compute the ‘ngram’ statistics. The most challenging part was to properly estimate continuation probabilities for each ‘ngram’ with the necessary smoothing techniques to estimate the impact of ngrams not listed in the corpus. The challenge was to find a strategy in manipulating strings while reducing compute times to estimate probabilities by scanning for string occurrences across the corpus. The problem was solved by creating a lookup table of ngrams (unigrams up to 4-grams) with their probabilities and their unigram, bigram, and/or trigram substructures.
In order to enhance prediction accuracy probabilities were adjusted using a technique called Kneser-Ney smoothing. Tables had to be stored as data frames of less than 1 GB of storage total to be able to predict the three mostly likely words that would continue an ngram in an app on shinyapps.io for quick, seamless response to a phrase submission on the website. The presentation will demonstrate the individual tools in R to arrive at the final web API including the R code and RMarkdown scripts developed for the data analysis and prediction algorithm, a short executive summary in terms of an RPres presentation on the public repository rpubs.org and finally a shiny app deployed on the web at shiyapps.io.

Natural Language Processing with R