Title: Developing and Managing Enterprise Big Data Application Workflows Using Cascading
Cascading is an open source API for writing Enterprise-scale apps on top of Apache Hadoop and other big data frameworks. It provides a high-level
abstraction for writing "workflows" in Java, Scala, Clojure, and other JVM languages, where an entire app gets compiled into a single JAR file. Cascading leverages the scalability of Hadoop but abstracts standard data processing operations away from underlying map and reduce tasks. Data "taps" are available for integrating with JDBC, HBase, Memcached, Cassandra, plus serialization in Apache Thrift, Avro, Kyro, etc.
In production use now for nearly 5 years, Cascading apps run in Finance, Health Care, Transportation, and other verticals. Studies have been published about large use cases at Twitter, Etsy, Trulia, TeleNav, Climate Corporation, Airbnb, and Williams-Sonoma.
This talk will provide an introduction to Cascading, examine where its use is indicated in contrast to other high-level abstractions such as Pig and Hive. We will give a brief tutorial developing Enterprise data workflows using Cascading and importing machine learning models from statistical packages like SAS or R. We will also give a demonstration of how administrators can visualize and analyze the behavior of Cascading workflows.
Paco Nathan is a Data Scientist at Concurrent in SF and a committer onCascading. He has expertise in Hadoop, R, AWS, machine learning, predictive analytics, NLP -- with 25+ years experience in the tech industry. For the past 10 years Paco has led innovative Data teams.