Data Workflows for Machine Learning

Main Talk: Data Workflows for Machine Learning 

Speaker: Paco Nathan

Abstract: 

We compare/contrast several open source frameworks which have emerged for Machine Learning workflows, including KNIME, IPython Notebook and related Py libraries, Cascading, Cascalog, Scalding, Summingbird, Spark/MLbase, MBrace on .NET, etc. The analysis develops several points for "best of breed" and what features would be great to see across the board for many frameworks... leading up to a "scorecard" to help evaluate different alternatives. We also review the PMML standard for migrating predictive models, e.g., from SAS to Hadoop.

Speaker bio: 

Paco Nathan, is a “player/coach” who's led innovative Data teams building large-scale apps for 10+ years, and worked as an OSS evangelist for the past 2+ years. Expert in distributed systems, machine learning, cloud computing, functional programming -- with a focus on Enterprise data workflows. Paco is an O'Reilly author, and an advisor for several firms including The Data Guild andZettacap. Paco received his BS Math Sci and MS Comp Sci degrees from Stanford University, and has 30+ years technology industry experience ranging from Bell Labs to early-stage start-ups.


Tentative Schedule: 

6:30-7:00 - socializing

7:00-8:00 - main talk

8:00-8:30 - socializing 


Special thanks: 

Climate Corporation for hosting!

Join or login to comment.

  • Ryan S.

    Lots of great content, great venue. Was happy to attend. Thanks to organizers and to Paco!

    2 · April 15, 2014

  • David A.

    5 · April 12, 2014

  • Usman Q.

    Nice overview. However, the acoustics were bad and the speakers were turned off :-(

    April 10, 2014

    • Paco N.

      sorry about the audio, i'll try to bring a personal lavalier setup for next talks

      April 10, 2014

  • Michael D.

    Thanks Paco, Tony and David! This meetup exceeded my expectations, covering both open source ML workflow tools and a framework for evaluating them. Along the way, Paco also presented a well-sourced history of ML tools. Exceptional.

    2 · April 10, 2014

    • Paco N.

      Many thanks! I really enjoyed all the Q&A, great discussions at the meetup

      April 10, 2014

  • Frank C.

    Good talk, Paco. I was not aware of so many new tools. Great survey, learned a lot. Look like Mbrace is the winner if you are working with Spark and MLBase. Thanks!

    1 · April 9, 2014

    • Paco N.

      Many thanks Frank. MBrace is based on .NET though, so it won't be interoperating with those others, e.g., Spark. There is work afoot to allow for F# and Cascalog to share libraries... a bit complex, but lots to leverage.

      April 9, 2014

    • Frank C.

      Sorry, I have gotten the information wrong about MBrace. Thanks for the correction. Fortunately, we standardize on the .NET platform for deployment in our Company. We'll give it a try. Will review your scorecard again. Tkx!

      April 10, 2014

  • A former member
    A former member

    Hope someone turns on the microphone!

    April 9, 2014

    • Paco N.

      Apologies if the audio was low, I wish I'd brought my lavalier mic

      April 9, 2014

  • Mike M.

    Great overview of technologies and work flows

    1 · April 9, 2014

  • e

    will this be recorded? if any slides or repo be posted?

    April 9, 2014

    • Tony T.

      Hi Ed, this talk was recorded. We'll post it up as soon as we get the video. Also, Paco posted up slides.

      2 · April 9, 2014

  • Paco N.

    Many thanks for the opportunity to present. Slides are available at https://www.slideshare.net/pacoid/data-workflows-for-machine-learning-33341183

    2 · April 9, 2014

    • Paco N.

      should be public now, had to double check that

      2 · April 9, 2014

  • Frank C.

    Uploaded file has been marked private by the author. Please make it public for members? Thanks.

    1 · April 9, 2014

    • Paco N.

      Thanks for noting that -- I'd pushed it public after a private upload, but it seems to have reverted. It's public now, just tested from a different account.

      April 9, 2014

  • Frank C.

    Nice host. Great venue and view of SF Bay Area.

    April 9, 2014

  • Amir Y.

    Any recording is planned? Thx

    1 · April 8, 2014

    • Paco N.

      Hakka Labs will record

      3 · April 8, 2014

    • Tony T.

      it looks like Hakka Labs had some scheduling issues. We'll try to get this meetup recorded though.

      1 · April 9, 2014

  • Adam J. B.

    Any plans to compare Mahout against the other Machine Learning technologies mentioned in the synopsis? I'd especially be interested to learn how it stacks up the Spark-centric machine learning package(s).

    1 · April 6, 2014

    • Adam J. B.

      Ah, didn't know about Mahout's new direction, that is exciting!

      In thinking about machine learning workflow , I'd also be skeptical about how Cascading/Pattern fits in. Sure Pattern can implement a scoring algorithm if you give it a model already defined in PMML, but Cascading (at least not to my knowledge) doesn't have the capacity to actually build the machine learning model by training over a massive amount of historical data. Isn't one of the first steps in a machine learning workflow to actually train the model?

      I'm not familiar with all the technologies mentioned, so am looking forward to this talk. I hope to learn about a machine learning tool that can both train the model and implement the scoring model in a distributed cloud environment. My experience thus far has usually been one or the other.

      April 8, 2014

    • Paco N.

      Definitely, please come to the talk. I have a hunch we'll have much to discuss :)

      1 · April 8, 2014

  • Dan B.

    I have had good results from scikit-learn and MADlib. I have them displayed here: www.spy611 dot com

    March 30, 2014

  • Tony T.

    All, please note that the event will now be held at the Climate Corporation.

    1 · March 28, 2014

  • Ted F.

    Hopefully the code used to produce the benchmarks will be provided.

    1 · March 27, 2014

    • Ted F.

      Ok. I would still appreciate seeing some code that was used to contrast the different frameworks. It would be beneficial to see how these frameworks are applied by a professional.

      1 · March 27, 2014

    • Paco N.

      Comparisons of sample code will attempt to address that, yes.

      March 28, 2014

  • Patrick S.

    Same day than "Apache Spark - Making Sense of Big Data Faster and Easier" in palo alto ....
    I was not given the gift of ubiquity.

    March 27, 2014

  • Phillip A.

    Will the "scorecard" be provided afterwards? I won't likely be able to make it.

    1 · March 27, 2014

    • Paco N.

      Certainly, I will post the slides right afterwards

      1 · March 27, 2014

  • Ramkumar R.

    Yes with one guest

    March 27, 2014

Our Sponsors

  • O'Reilly Strata

    20% off Strata registration with code "SFBAML20" http://oreil.ly/UGSJ15

People in this
Meetup are also in:

You don't have to be an expert to start your own Meetup G roup

Get started Learn more
Katie

I'm surprised by the level of growth I've seen since becoming an organizer, it's given me more confidence in my abilities.

Katie, started NYC ICO

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy