addressalign-toparrow-leftarrow-rightbackbellblockcalendarcameraccwchatcheckchevron-downchevron-leftchevron-rightchevron-small-downchevron-small-leftchevron-small-rightchevron-small-upchevron-upcircle-with-checkcircle-with-crosscircle-with-pluscrosseditemptyheartexportfacebookfolderfullheartglobegmailgoogleimagesinstagramlinklocation-pinmagnifying-glassmailminusmoremuplabelShape 3 + Rectangle 1outlookpersonplusprice-ribbonImported LayersImported LayersImported Layersshieldstartickettrashtriangle-downtriangle-uptwitteruseryahoo

Spark - Shark Data Analytics Stack on a Hadoop Cluster Tues. April 23 @6pm

Broadcast URL:


University of Colorado Denver - Tuesday April 23, 2013 @ 6:00pm MST

Large auditorium (170 person capacity) with 20' screen.

Location: CU Denver - North Classroom #1539 - 1200 Larimer Street
Denver, CO[masked] - Map:


6:00 - 6:15 Schmooze - Old Chicago Pizza will be served.

6:15 - 8:30 Demonstrate the Spark - Shark Data Analytics Stack on a Hadoop Cluster

8:30 - 9:30 Network at Old Chicago at 14th and Market.






Data scientists need to be able to access and analyze data quickly and easily. The difference between high-value data science and good data science is increasingly about the ability to analyze larger amounts of data at faster speeds. Speed kills in data science and the ability to provide valuable, actionable insights to the client in a timely fashion can mean the difference between competitive advantage and no or little value-added.

One flaw of Hadoop MapReduce is high latency. Considering the growing volume, variety and velocity of data, organizations and data scientists require faster analytical platforms. Put simply, speed kills and Spark gains speed through caching and optimizing the master/node communications.

The Berkeley Data Analytics Stack (BDAS) is an open source, next-generation data analytics stack under development at the UC Berkeley AMPLab whose current components include Spark, Shark and Mesos.

Spark is an open source cluster computing system that makes data analytics fast. To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.

Spark is a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100 times considering its ability to perform computations in memory. It is a computation engine built on top of the Hadoop Distributed File System (HDFS) that efficiently support iterative processing (e.g., ML algorithms), and interactive queries.

Shark is a large-scale data warehouse system that runs on top of Spark and is backward-compatible with Apache Hive, allowing users to run unmodified Hive queries on existing Hive workhouses. Shark is able to run Hive queries 100 times faster when the data fits in memory and up to 5-10 times faster when the data is stored on disk. Shark is a port of Apache Hive onto Spark that is compatible with existing Hive warehouses and queries. Shark can answer HiveQL queries up to 100 times faster than Hive without modification to the data and queries, and is also open source as part of BDAS.

Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications such as Hadoop, MPI, Hypertable, and Spark. As a result, Mesos allows users to easily build complex pipelines involving algorithms implemented in various frameworks.

This presentation covers the nuts and bolts of the Spark, Shark and Mesos Data Analytics Stack on a Hadoop Cluster. We will demonstrate capabilities with a data science use-case.


Michael Malak is a Data Analytics Senior Engineer at Time Warner Cable. He has been pushing computers to their limit since the 1970's.  Mr. Malak earned his M.S. Math degree from George Mason University. He blogs at

Chris Deptula is a Senior System Integration Consultant with OpenBI and is responsible for data integration and implementation of Big Data systems. With over 5 years experience in data integration, business intelligence, and big data platforms, Chris has helped deploy multiple production Hadoop clusters. Prior to OpenBI, Chris was a consultant with FICO implementing marketing intelligence and fraud identification systems. Chris holds a degree in Computer and Information Technology from Purdue University. Follow Chris on Twitter @chrisdeptula.

Michael Walker is a managing partner at Rose Business Technologies, a professional technology services and systems integration firm. He leads the Data Science Professional Practice at Rose. Mr. Walker received his undergraduate degree from the University of Colorado and earned a doctorate from Syracuse University. He speaks and writes frequently about data science and is writing a book on Data Science Strategy for Business. Learn more about the Rose Data Science Professional Practice at Follow Mike on Twitter @Ironwalker76.

Join or login to comment.

  • A former member
    A former member

    Lot of great content. Had to watch online, but it was still great. Nice job.

    April 24, 2013

    • A former member
      A former member

      Agreed, except for the commercials. Oy.

      April 24, 2013

    • A former member
      A former member

      I think the Windex commercial must of had relevance to Michael Walker's comment about Veracity being the 5th V. What else could give you a clear view into clean data than Windex. Must have been some solid machine learning driving the targeting of those ads.

      1 · April 24, 2013

  • Irena P.

    This was my first meetup with this group - very impressive. Interesting presentations, learned some new things. Sorry I couldn't stay to socialize afterwards - hopefully next time!

    April 24, 2013

  • Eric C.

    Great Meeting, introducing Shark, Pentaho and the Use Case. Thank you Presenters!

    1 · April 24, 2013

  • Sanity S.

    Excellent presentations. Michael Malek gave a terrific overview of the technology (Spark and Shark and the next generation tools for streaming analytics); Michael Walker provided both the high-level overview of Big Data and the role of Data Scientists along with a real world very interesting example of big data using NFL statistics from the last 10 years to test of theory of more plays (hurry up and no huddle offense) contributing to winning more games. Chris Deptula gave a riveting demo of the power of visualization software like Pentaho to help data scientists discern key trends as well as powerfully present the findings. Well done by all !

    1 · April 23, 2013

  • John D.

    Excellent presentation/material. Glad I could remote in, sorry I couldn't get the pizza.

    April 23, 2013

    • gretchen g.

      awesome presentation, thank you!

      April 23, 2013

  • Glen B A.

    i watched on the stream. It was very good, I learned that I need more learning and need to make face-to-face contact with people who might hel me with my problem - forecasting cost and schedule performance from past performance on large defense programs.

    April 23, 2013

  • A former member
    A former member

    What classroom?

    April 23, 2013

  • A former member
    A former member

    It is room 1536

    April 23, 2013

  • Machender J.


    April 22, 2013

  • Shane K.

    planning to try to make this one. Any chance of a webcast?

    April 11, 2013

    • Michael W.

      Yes, our live webcast in conjunction with a live webcast produced by Big Data Week.

      April 12, 2013

  • Gretchen G.

    sounds great!

    April 4, 2013

Our Sponsors

People in this
Meetup are also in:

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy