Zum Inhalt springen

Schedoscope: Painfree scheduling for agile Hadoop data warehouses

Schedoscope: Painfree scheduling for agile Hadoop data warehouses

Details

Dear Hug members,

We've organized another great talk for you guys on November 17th!

Dominik will be our special guest of the evening and will talk about Schedoscope.

This talk motivates Schedoscope, compares it to other scheduling systems, and gives a demonstration of the complete scheduling lifecycle with Schedoscope (For more information on the talk, check out the abstract below)

I am really looking forward to see you all out there! :)

Cheers,

Selma

Abstract:

Scheduling ingestion and transformation jobs within a Hadoop-based data warehouse with standard technologies like Oozie can be tedious: Logic which has already been developed has to be integrated in a manual and error-prone way in XML workflows and bundles. Testing becomes a rather complicated and time-consuming task, avoidable simple errors only show up at runtime, and finally the cluster utilization is suboptimal. Things even get worse when it comes to adaptations of either schemas or transformation logic, which may require migration scripts and heavy manual involvement in e.g. selecting data to delete or (re)starting individual jobs.

Schedoscope is a scheduling framework developed by the Otto Group. Its primary goal is the agile development, testing, loading and reloading of Hive-based tables within a Hadoop cluster. It provides a Scala-DSL, which allows to specify (i) table and partition structures, (ii) the dependencies among them and (iii) the transformations for loading table data in a concise way.

When using Schedoscope,

  • managing DDLs and migration scripts is superfluous

  • changes in data structures and transformation rules are detected automatically and all necessary (re)computations are initialized

  • one can choose among a number of supported transformation mechanisms, among others file operations, MapReduce, Hive, Pig and Oozie

  • Scala's static type system together with the IDE's autocompletion features avoids simple typos and errors during development, not at runtime

  • one can easily write compact and lightweight unit tests, which are run directly in the IDE

  • specifying only the relevant target table when loading data is sufficient; all necessary intermediate steps are detected, executed and managed by Schedoscope

  • cluster utilization is better compared to Oozie, because no resource-consuming launcher tasks are started

This talk motivates Schedoscope, compares it to other scheduling systems, and gives a demonstration of the complete scheduling lifecycle with Schedoscope.
Schedoscope is available open-source at http://schedoscope.org and is presented as a system demo at Strata + Hadoop World 2015, San Jose.

Speaker Bio:

Dr. Dominik Benz holds a PhD in Computer Science from the University of Kassel in the field of Knowledge and Data Engineering. Since 2012 he is working as a Big Data Architect at Inovex GmbH. A focus of his work is the design and setup of Hadoop-based data warehouse infrastructures within major companies in Germany. Besides his function as a committer for the open-source scheduling framework "Schedoscope", he is an active member of the German Hadoop community, with contributions at e.g. Berlin Buzzwords and various Meetups.

Photo of AI Performance Engineering Meetup (Munich) group
AI Performance Engineering Meetup (Munich)
Mehr Events anzeigen
comSysto GmbH
Tumblingerstr. 23 80337 · München