Schedoscope: Painfree scheduling for agile Hadoop data warehouses

Name: Schedoscope: Painfree scheduling for agile Hadoop data warehouses
Start: 2015-11-17T18:30:00+01:00
End: 2015-11-17T21:30:00+01:00
Location: comSysto GmbH

Veranstaltet von AI Performance Engineering Meetup (Munich)

Lerne die Gruppe kennen

AI Performance Engineering Meetup (Munich)

Noch keine Bewertungen

Details

Dear Hug members,

We've organized another great talk for you guys on November 17th!

Dominik will be our special guest of the evening and will talk about Schedoscope.

This talk motivates Schedoscope, compares it to other scheduling systems, and gives a demonstration of the complete scheduling lifecycle with Schedoscope (For more information on the talk, check out the abstract below)

I am really looking forward to see you all out there! :)

Cheers,

Selma

Abstract:

Scheduling ingestion and transformation jobs within a Hadoop-based data warehouse with standard technologies like Oozie can be tedious: Logic which has already been developed has to be integrated in a manual and error-prone way in XML workflows and bundles. Testing becomes a rather complicated and time-consuming task, avoidable simple errors only show up at runtime, and finally the cluster utilization is suboptimal. Things even get worse when it comes to adaptations of either schemas or transformation logic, which may require migration scripts and heavy manual involvement in e.g. selecting data to delete or (re)starting individual jobs.

Schedoscope is a scheduling framework developed by the Otto Group. Its primary goal is the agile development, testing, loading and reloading of Hive-based tables within a Hadoop cluster. It provides a Scala-DSL, which allows to specify (i) table and partition structures, (ii) the dependencies among them and (iii) the transformations for loading table data in a concise way.

When using Schedoscope,

managing DDLs and migration scripts is superfluous
changes in data structures and transformation rules are detected automatically and all necessary (re)computations are initialized
one can choose among a number of supported transformation mechanisms, among others file operations, MapReduce, Hive, Pig and Oozie
Scala's static type system together with the IDE's autocompletion features avoids simple typos and errors during development, not at runtime
one can easily write compact and lightweight unit tests, which are run directly in the IDE
specifying only the relevant target table when loading data is sufficient; all necessary intermediate steps are detected, executed and managed by Schedoscope
cluster utilization is better compared to Oozie, because no resource-consuming launcher tasks are started

This talk motivates Schedoscope, compares it to other scheduling systems, and gives a demonstration of the complete scheduling lifecycle with Schedoscope.
Schedoscope is available open-source at http://schedoscope.org and is presented as a system demo at Strata + Hadoop World 2015, San Jose.

Speaker Bio:

Dr. Dominik Benz holds a PhD in Computer Science from the University of Kassel in the field of Knowledge and Data Engineering. Since 2012 he is working as a Big Data Architect at Inovex GmbH. A focus of his work is the design and setup of Hadoop-based data warehouse infrastructures within major companies in Germany. Besides his function as a committer for the open-source scheduling framework "Schedoscope", he is an active member of the German Hadoop community, with contributions at e.g. Berlin Buzzwords and various Meetups.

Schedoscope: Painfree scheduling for agile Hadoop data warehouses

AI Performance Engineering Meetup (Munich)

Details

Mitglieder interessieren sich auch für