This month, we're going to try something different, as an inaugural meeting at the permanent Seattle Twitter office: a full-length talk by Paco Nathan:
Title: Data Workflows for Machine Learning
A variety of tools and frameworks for large-scale data workflows have emerged, which has substantial impact on machine learning practices in industry. On the one hand, ML work can be integrated more readily into a wide range of other frameworks, and be migrated across environments. An example case is to train a model in SAS on a data sample, then export the model as PMML, to be run at scale on a Hadoop cluster (sans license fees) based on Cascading/Pattern. Other great examples include: KNIME (R, Weka, Eclipse, Hadoop, Actian, etc.); ADAPA from Zementis in Amazon AWS workflows; and in the Python stack an ecosystem of Augustus, scikit-learn, Pandas, IPython, etc. In the emerging category, there is Spark/MLbase, and also Julia with a variety of integrations. Spark and Scala integrations become quite interesting in the the broader context of Summingbird and Algebird -- indicating how some of notions of workflow could be generalized. This talk considers a compare & contrast of these different workflow approaches, along with some perspectives on use cases and indications, plus where they appear to be heading.
About the speaker:
Paco Nathan is an O'Reilly author ("Enterprise Data Workflows with Cascading") and an advisor for The Data Guild in Palo Alto, CA. He was formerly a lead dev on the "Pattern" open source project for PMML scoring in Cascading, and teaches "Intro to Machine Learning" and "Intro to Data Science" courses based on R, Python, Scala, etc.