Past Meetup

Enterprise Data Workflows with Cascading and Windows Azure HDInsight

This Meetup is past

54 people went

Microsoft San Francisco (in Westfield Mall where Powell meets Market Street)

835 Market Street Golden Gate Rooms - 7th Floor · San Francisco, CA

How to find us

We are in the same building as Westfield Mall; ask security guard to let you on the 7th floor.

Location image of event venue


Cascading is a open source API for building Enterprise data workflows at scale, integrating Apache Hadoop with other data frameworks. Its pattern language for data pipelines provides a foundation for popular DSLs based in functional programming languages, such as Cascalog and Scalding. The HDInsight team has been working with Cascading and Scalding apps on Azure, also the recent release of Hortonworks HDP for Windows provides Hadoop on Windows Made Easy. Concurrent, the team behind Cascading, has partnered with both Microsoft and Hortonworks to help bring a powerful abstraction layer for Enterprise data workflows to the convenience and versatility of HDInsight and HDP.

Recent work has also added ANSI SQL and PMML as additional languages atop Cascading. Now people who have backgrounds working with SQL data warehouses or analytics frameworks such as R, Weka, SAS, SPSS, etc., can build large-scale apps to run on Hadoop just as well as developers working in Java, Clojure, Scale, etc. While a typical Enterprise workflow crosses through multiple departments and frameworks -- perhaps SQL for ETL, perhaps J2EE for business logic and data prep, perhaps SAS for predictive models -- Cascading allows multiple departments to integrate their workflow components into one app, one JAR file. This talk will show (1) using R and SQL on a laptop to define a complex app, then (2) using to Cascading to integrate those components into a single JAR file which runs on a Hadoop cluster in parallel at scale.

About Paco Nathan

Paco Nathan is the Director of Data Science at Concurrent in SF and a committer on the Cascading open source project. He has expertise in Hadoop, R, AWS, machine learning, predictive analytics -- with 25+ years in the tech industry overall. For the past 10+ years Paco has led innovative Data Science teams, building large-scale apps. He is the author of the O'Reilly book "Enterprise Data Workflows with Cascading".