** PLEASE RSVP FOR THIS EVENT AT DATA SCIENCE DC (http://www.meetup.com/Data-Science-DC/events/83813992/) **
We're thrilled to be partnering with Data Science DC (http://www.meetup.com/Data-Science-DC) for this meetup!
Paco Nathan will be talking about his experiences working with and deploying enterprise-scale predictive systems. Cascading (http://www.cascading.org/), the open-source application framework that Paco specializes in, is a wrapper around Hadoop.
Notes: We'll meet at newBrandAnalytics for this event, and we'll be trying a new themed networking topic -- please come early to meet and chat with people interested in, or perhaps immersed in, startup businesses with an analytics or data science focus, and continue the conversation afterwards at Data Drinks.
6:30pm -- Networking and Refreshments (Discussion theme: Startups) 7:00pm -- Introduction 7:15pm -- Paco's presentation and Q&A 8:30pm -- Post presentation conversations 8:45pm -- Adjourn for Data Drinks Happy Hour Prices & Our own floor @ Science Club DC (19th btwn L&M) Abstract:
Cascading is an open source project which provides an abstraction layer on top of Hadoop and other compute frameworks for Big Data apps. The API provides workflow orchestration for defining complex apps, and is particularly well-suited for Enterprise IT. Large deployments run at Twitter, Etsy, Climate Corp, Trulia, AirBnB, and many other firms, based on the Java API or alternatively using DSLs in Scala (Scalding) and Clojure (Cascalog), as well as other JVM-based languages.
This talk will review some of the speaker's experiences leading Datateams for large-scale deployments of predictive analytics, and howthose learnings have led into trade-offs and best practices which we use in Cascading. We will discuss use cases and architectural patterns for large MapReduce workflows, when robustness and predictability are high priorities. We will also review a sample recommender application (on GitHub) based on government Open Data.
Paco Nathan is a Data Scientist at Concurrent (http://www.concurrentinc.com/) in SF and a committer on the Cascading.org (http://www.cascading.org/) open source project. He has expertise in Hadoop, R, AWS, machine learning, predictive analytics, and NLP -- with 25+ years in the tech industry overall, in a range of Enterprise and Consumer Internet firms. For the past 10 years Paco has led innovative Data teams, deploying Big Data apps based on Cascading, Hadoop, HBase, Hive, Lucene, Redis, and related technologies.