For our October Meetup, we're thrilled to have Paco Nathan talking about his experiences working with and deploying enterprise-scale predictive systems. Cascading (http://www.cascading.org/), the open-source application framework that Paco specializes in, is a wrapper around Hadoop, so we're thrilled to be partnering with Hadoop DC (http://www.meetup.com/Hadoop-DC/) this month!
Notes: We're back at newBrandAnalytics for this event! And we'll be continuing our experiment with informal pre-event themed networking -- please come early to meet and chat with people interested in, or perhaps immersed in, startup businesses with an analytics or data science focus, and continue the conversation afterwards at Data Drinks.
6:30pm -- Networking and Refreshments (Discussion theme: Startups) 7:00pm -- Introduction 7:15pm -- Paco's presentation and Q&A 8:30pm -- Post presentation conversations 8:45pm -- Adjourn for Data Drinks Happy Hour Prices & Our own floor @ Science Club DC (19th btwn L&M) Abstract:
Cascading is an open source project which provides an abstraction layer on top of Hadoop and other compute frameworks for Big Data apps. The API provides workflow orchestration for defining complex apps, and is particularly well-suited for Enterprise IT. Large deployments run at Twitter, Etsy, Climate Corp, Trulia, AirBnB, and many other firms, based on the Java API or alternatively using DSLs in Scala (Scalding) and Clojure (Cascalog), as well as other JVM-based languages.
This talk will review some of the speaker's experiences leading Data teams for large-scale deployments of predictive analytics, and how those learnings have led into trade-offs and best practices which we use in Cascading. We will discuss use cases and architectural patterns for large MapReduce workflows, when robustness and predictability are high priorities. We will also review a sample recommender application (on GitHub) based on government Open Data.
Paco Nathan is a Data Scientist at Concurrent (http://www.concurrentinc.com/) in SF and a committer on the Cascading.org (http://www.cascading.org/) open source project. He has expertise in Hadoop, R, AWS, machine learning, predictive analytics, and NLP -- with 25+ years in the tech industry overall, in a range of Enterprise and Consumer Internet firms. For the past 10 years Paco has led innovative Data teams, deploying Big Data apps based on Cascading, Hadoop, HBase, Hive, Lucene, Redis, and related technologies.