(Easy) High performance text processing in Machine Learning

This is a past event

275 people went

Location image of event venue


This month we have Daniel Krasner presenting "(Easy) High performance text processing in Machine Learning".

Note: Pivotal will be hosting. However, drinks and snacks are not provided as our group is just too large to sponsor that. So please respect the space and don't touch the fridges.


This talk covers rapid development of high performance scalable text processing solutions for tasks such as classification, semantic analysis, topic modeling and general machine learning. We demonstrate how Python modules, and in particular the Rosetta Python library, can be used to process, clean, tokenize, extract features, and finally build statistical models with large volumes of text data. The Rosetta library focuses on creating small and simple modules (each with command line interfaces) that use very little memory and are parallelized with the multiprocessing package. We will touch on LDA topic modeling and different implementations thereof (Vowpal Wabbit and Gensim). The talk will be part presentation and part “real life” example tutorial.


Daniel Krasner is a research scholar with the “Declassification Project” at Columbia University and the co-Founder of KFit Solutions, a data science consulting firm. His current interests and work focus on high performance statistical solutions in text and natural language processing. He is the co-creator or “Rosetta,” an open source python text processing library. In addition, Daniel continually works with a number of hedge funds in the city, building financial modeling and decision support systems. Previously, Daniel was the chief data scientist at Sailthru, an email and behavioral analytics platform, a senior researcher at Johnson Research Labs, and a professor teaching Applied Data Science in the Columbia University statistics department. Prior to entering the world of data science, Daniel Krasner was a researcher at the Mathematical Sciences Research Institute in Berkeley and an assistant professor of mathematics at UCLA. He holds a PhD in mathematics from Columbia University.