Broadcast URL: http://www.ustream.tv/channel/denverspark
University of Colorado Denver - Tuesday April 23, 2013 @ 6:00pm MST
Large auditorium (170 person capacity) with 20' screen.
Location: CU Denver - North Classroom #1539 - 1200 Larimer Street
Denver, CO[masked] - Map: http://bit.ly/Tyznzg
6:00 - 6:15 Schmooze - Old Chicago Pizza will be served.
6:15 - 8:30 Demonstrate the Spark - Shark Data Analytics Stack on a Hadoop Cluster
8:30 - 9:30 Network at Old Chicago at 14th and Market.
NOTE: BRING LAPTOP / SMART DEVICE - WIRELESS PROVIDED
Data scientists need to be able to access and analyze data quickly and easily. The difference between high-value data science and good data science is increasingly about the ability to analyze larger amounts of data at faster speeds. Speed kills in data science and the ability to provide valuable, actionable insights to the client in a timely fashion can mean the difference between competitive advantage and no or little value-added.
One flaw of Hadoop MapReduce is high latency. Considering the growing volume, variety and velocity of data, organizations and data scientists require faster analytical platforms. Put simply, speed kills and Spark gains speed through caching and optimizing the master/node communications.
The Berkeley Data Analytics Stack (BDAS) is an open source, next-generation data analytics stack under development at the UC Berkeley AMPLab whose current components include Spark, Shark and Mesos.
Spark is an open source cluster computing system that makes data analytics fast. To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.
Spark is a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100 times considering its ability to perform computations in memory. It is a computation engine built on top of the Hadoop Distributed File System (HDFS) that efficiently support iterative processing (e.g., ML algorithms), and interactive queries.
Shark is a large-scale data warehouse system that runs on top of Spark and is backward-compatible with Apache Hive, allowing users to run unmodified Hive queries on existing Hive workhouses. Shark is able to run Hive queries 100 times faster when the data fits in memory and up to 5-10 times faster when the data is stored on disk. Shark is a port of Apache Hive onto Spark that is compatible with existing Hive warehouses and queries. Shark can answer HiveQL queries up to 100 times faster than Hive without modification to the data and queries, and is also open source as part of BDAS.
Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications such as Hadoop, MPI, Hypertable, and Spark. As a result, Mesos allows users to easily build complex pipelines involving algorithms implemented in various frameworks.
This presentation covers the nuts and bolts of the Spark, Shark and Mesos Data Analytics Stack on a Hadoop Cluster. We will demonstrate capabilities with a data science use-case.
Michael Malak is a Data Analytics Senior Engineer at Time Warner Cable. He has been pushing computers to their limit since the 1970's. Mr. Malak earned his M.S. Math degree from George Mason University. He blogs at http://www.technicaltidbit.com.
Chris Deptula is a Senior System Integration Consultant with OpenBI and is responsible for data integration and implementation of Big Data systems. With over 5 years experience in data integration, business intelligence, and big data platforms, Chris has helped deploy multiple production Hadoop clusters. Prior to OpenBI, Chris was a consultant with FICO implementing marketing intelligence and fraud identification systems. Chris holds a degree in Computer and Information Technology from Purdue University. Follow Chris on Twitter @chrisdeptula.
Michael Walker is a managing partner at Rose Business Technologies, a professional technology services and systems integration firm. He leads the Data Science Professional Practice at Rose. Mr. Walker received his undergraduate degree from the University of Colorado and earned a doctorate from Syracuse University. He speaks and writes frequently about data science and is writing a book on Data Science Strategy for Business. Learn more about the Rose Data Science Professional Practice at http://bit.ly/10TgVHG. Follow Mike on Twitter @Ironwalker76.