Rephil: Extracting Concepts from Text


Details
This talk will describe Rephil, a system used throughout Google to identify the concepts or topics that underlie a given piece of text. Rephil determines, for example, that "apple pie" falls under some of the same topics as "chocolate cake", but has little in common with "apple ipod". The concepts used by Rephil are not pre-specified; instead, they are derived by an unsupervised learning algorithm running on massive amounts of text. The result of this learning process is a Rephil model -- a giant Bayesian network with concepts as nodes. I will discuss the structure of Rephil models, the distributed machine learning algorithm that we use to build these models from terabytes of data, and the Bayesian network inference algorithm that we use to identify concepts in new texts under tight time constraints. I will also discuss how Rephil relates to ongoing academic research on probabilistic topic models.

Rephil: Extracting Concepts from Text