Supervised text classification is hampered by the need to acquire expensive labeled training sets. Algorithms similar to Word2Vec text embedding can create vector representations of documents that enable a model to be successfully trained through machine learning. In this talk we review prior art for continuous word embedding representations and discuss recent advances in leveraging these techniques in a semi-supervised context for drastically improved results, enabling reduced active learning intervention in sparse contexts. We conclude with discussion of how these results are affected by recent advances in using deep learning to train recursive neural tensor networks for sentence level text embedding in continuous vector spaces.
With over a decade of experience in Data Science and Education, Mike serves as Chief Science Officer and Chief Learning Officer for Galvanize, overseeing all Data Science consulting services as well as curriculum product development and instruction at Galvanize. Mike helped to found and launch the accredited GalvanizeU-UNH Masters program which is focused on developing the skills required of high performing Data Scientists in the industry. He has led several teams of Data Scientists in the bay area as Chief Data Scientist for InterTrust and as Director of Data Sciences for the Sears Holding Company and MetaScale. Mike began his career in academia serving as a mathematics teaching fellow for Columbia University before teaching at the University of Pittsburgh. His early research focused on developing the epsilon-anchor methodology for resolving both an inconsistency he highlighted in the dynamics of Einstein’s general relativity theory and the convergence of “large N” Monte Carlo simulations in Statistical Mechanics’ universality models of criticality phenomena.