(Easy) High performance text processing in Machine Learning

This month we have Daniel Krasner presenting "(Easy) High performance text processing in Machine Learning".

Note: Pivotal will be hosting. However, drinks and snacks are not provided as our group is just too large to sponsor that. So please respect the space and don't touch the fridges.


This talk covers rapid development of high performance scalable text processing solutions for tasks such as classification, semantic analysis, topic modeling and general machine learning. We demonstrate how Python modules, and in particular the Rosetta Python library, can be used to process, clean, tokenize, extract features, and finally build statistical models with large volumes of text data. The Rosetta library focuses on creating small and simple modules (each with command line interfaces) that use very little memory and are parallelized with the multiprocessing package. We will touch on LDA topic modeling and different implementations thereof (Vowpal Wabbit and Gensim). The talk will be part presentation and part “real life” example tutorial.


Daniel Krasner is a research scholar with the “Declassification Project” at Columbia University and the co-Founder of KFit Solutions, a data science consulting firm. His current interests and work focus on high performance statistical solutions in text and natural language processing. He is the co-creator or “Rosetta,” an open source python text processing library. In addition, Daniel continually works with a number of hedge funds in the city, building financial modeling and decision support systems. Previously, Daniel was the chief data scientist at Sailthru, an email and behavioral analytics platform, a senior researcher at Johnson Research Labs, and a professor teaching Applied Data Science in the Columbia University statistics department. Prior to entering the world of data science, Daniel Krasner was a researcher at the Mathematical Sciences Research Institute in Berkeley and an assistant professor of mathematics at UCLA. He holds a PhD in mathematics from Columbia University.

Join or login to comment.

  • Jonathan Y

    Video from the this meetup here: http://www.hakkalabs.co/article...­

    March 6

  • Nitin k.

    Excellent presentation in simple language. Can we have the slides and a link to the presentation, Paul....
    Thanks in advance.

    February 23

  • John Peter S.

    Good talk. Really good. I also will contact you for the slides. Thanks to Paul and Pivotal.

    1 · February 21

  • Salil N.

    Great talk! Anybody has the contact of the guy who announced about shuttershock job opening?

    February 21

    • Jerry G.

      His name is Eliot Brenner. You should find his profile among the group here or on LinkedIn.

      February 21

    • Eliot

      Sorry I had to run after the talk! Feel free to send me a Meetup message if you have questions about the Shutterstock data scientist, and data algorithm engineer full-time positions and/or internships!

      February 21

  • A former member
    A former member

    Great talk, it was very clear. Would it be possible to download the slides?

    3 · February 21

  • Gary G.

    Very nice and understandable presentation with practical Python engineering advice and advanced thoughts on the use of -- and interfacing to -- a suitable LDA library.

    February 20

  • Niels B.

    Can't seem to find the Rosetta library... anyone have the link?

    February 20

  • Iordan S.

    Can't make it unfortunately. Releasing the seat.

    February 20

  • Paul D.

    Also, the nice people at http://confreaks.com/­ are sponsoring captioning for accessibility for tonight. Go here for that: http://www.streamtext.net/playe...­

    February 20

  • Paul D.

    To all of those that don't make it through the waiting list, Pivotal will be live streaming the event here: http://www.livestream.com/pivot...­

    Load that page and a link should show up when it starts. I'll be kicking things off around 7:05 PM

    February 20

    • Daren

      Will it be saved for later viewing?

      February 20

    • Paul D.

      yes, it will be recorded

      February 20

  • H L.

    I can't make it tonight. I am giving up my seat.

    February 20

  • Aaron H.

    Remember to release your RSVP's if you can't make it!

    February 20

  • Farhan A.

    A work obligation came up and I won't be able to make it any more :(

    February 20

  • A former member
    A former member

    Any updates on the wait list? I really want to attend the event

    February 20

  • Chakri

    hi, is this session is hands-on? what software is needed apart from python? is there anything on git-hub to get started?


    February 19

    • Daniel K.

      Hi Chakri, If time permits I'll run a ipython notebook demo, which you can later grab. If you want to look at the python rosetta library that would be helpful.

      February 19

  • BigData L.

    Same here. Last time someone was recording in the Pivotal Labs I think. Hoping you can do the same and then post.

    February 19

  • shilpa

    Can I have some video of the event as can not attend due to distance problem.

    February 19

  • James Q

    Will a video or slides be made available? Turns out I will not be able to make this one :(

    1 · February 17

    • Paul D.

      Yes, for this one we should have video

      5 · February 18

  • A former member
    A former member

    I would love to attend this event since I am starting on using sentiment detection from news feeds

    February 16

  • Gene E.

    Do some textmining myself

    February 8

People in this
Meetup are also in:

Create your own Meetup Group

Get started Learn more

I'm surpris ed by the level of growth I've seen since becoming an organizer, it's given me more confidence in my abilities.

Katie, started NYC ICO

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy