San Francisco Hadoop Users Message Board › Mar 2011 - Production workflow tools

Mar 2011 - Production workflow tools

A former member
Post #: 9

In many current production Hadoop environments, there is a
& daunting gap in the availability of workflow management tools
may be considered "production grade". This can be a huge
obstacle to adoption
& maintainability of Hadoop in "real world" production
environments & use cases.

--Requirements for a Production grade workflow system.--
To start the discussion, it is useful to ask ...
What are the desired requirements for Production Grade Workflow
Below is the short list of desired features:

1) Define & Synchronize components of a workflow within an
Hadoop environment.
(Essentially a Directed Acyclic Graph)...
1a) Workflow execution path should be easily visible.
( It is important to remember that when things go wrong at 3:00 AM,
you are probably not going to get a Java / Hadoop expert to be
the first to respond ). A workflow solution should provide
some simple visibility into the progress of the workflow path,
which does NOT involve scanning long exception / stack traces
@ various locations.
1b) The DAG should extend to components which are not
confined to the scope of the Hadoop environment.
- The availability &/or completeness of raw input data.
2) Error / Exception Handling & alerting.
3) Support for a range of methodologies & languages.
( M/R , Pig , Java , shell process, etc...)
4) Retry on Error / Exception.

--Survey of current solutions in use--
A quick survey around the group reveals 5 workflow options
currently in use.
Listed in order : (Most commonly used [rough guess]).

1) Cron on the jump host : This seems to be the most common
case (Alas!).
- Rigid/Static Time based scheduling.
- No Dependencies between components.
- No exception/error handling or reporting.
- No Retry on Error.

Notes : This approach is OK for a development environment.
However, it is extremely unwieldy & awkward in a production environment.
The Cron solution is lacking in most of the basic requirements
(synchronization, exception handling, etc).. for a workflow
2) Do It Yourself : Many have cobbled together a collection of
scripts / tools (perl, python, groovy) to get some basic
workflow management capabilities. The feature range here
varies with implementation. But most DIY solutions provide
just a bare minimum of features on top of Cron, such as:
- Basic error reporting.
- Minimal synchronization.

Notes: Everyone who has gone down this road (myself included)
- "Don't go there! it is painful" &
- "We only did it this way because there was nothing available
at the time".

3) Azkaban (http://sna-projects.c...­) An open source ,
simple batch
scheduler developed at LinkedIn.
See:(http://www.slideshare...­ ) for a good
intro presentation. Azkaban meets some of the basic requirements of a
modern production workflow environment:
- Time based scheduling.
- Dependencies (defined in a way similar to Make/Ant)
- Workflow path is fixed at start time.
- Error Handling
- Email based Error alerting : (Sort of), See:
-- ( http://snaprojects.ji...­ )
-- (­ )
-- ( http://snaprojects.ji...­ )
- Retry on error.
- Support for resource locks.
- Some capabilities to handle management of components outside
of the Hadoop environment (with a little DIY)
- The biggest virtue of Azkaban is its simplicity . You can get it
up & running very quickly with just a few lines in an Azkaban
job definition file.

- Azkaban runs as a Tomcat webapp with a decent GUI interface
which would allow Ops staff (who are not necessarily Hadoop
experts) to handle error conditions.

- Early adopters (using Vers 0.03) got to experience a bit of pain,
But it is now at Vers. 0.10 & is much improved.
A former member
Post #: 10
Individual posts have a maximum length, but there were lots more notes. Here are notes part 2.

4) Cascading ( http://www.cascading....­ ) : An open source ,
thin Java library & API , Process Planner & Scheduler for creating
workflows on a Hadoop
May be driven from Jython, JRuby, Groovy, etc...).
Cascading is an excellent choice for those running in advanced
production environments.
- Time, input-data, & direct program control based scheduling.
- Dependencies (DAG style).
- Workflow path is extremely flexible (may be under the apps control).
- Strong error handling & reporting.
- Retry on error.
- Does not handle management of components outside the cluster.

- If you can work within the constraints of having your M/R workflow
packaged up in one big jar, you have a very powerful workflow tool
- PH : I considered Cascading a little > 1 year ago, but we run 75%
Pig in our production environment & there were some barriers
to implementation at that time.. However, now there is Riffle
( http://www.cascading....­ ).
So, I may take a 2nd look at that ...
- There were some comments which suggest that the learning curve for
implementing & using Cascading may be a bit steeper than
other solutions mentioned here.

5) Oozie ( http://yahoo.github.c...­ ) : An open source
service (developed at yahoo) for production Hadoop environments. Oozie is feature
rich & show good potential as a workflow solution. For a good intro,
see( http://developer.yaho...­ ).

- Time & input-data based scheduling.
- Dependencies (DAG & PDL style).
- Workflow path is flexible (via decision nodes).
- Error handling & alerting (strong).
- Retry on error.
- No resource locks
- Does not handle management of components outside of the
Hadoop environment.

- Oozie 2 look very promising & is packaged with CDH3.
- Although the primary author, Alejandro Abdeinur , now
works for Cloudera, Oozie remains an independent
open source project.
- PH : I tried to implement Oozie in my environment but,
I ran into some dependency issues ( we are running an
older version of Hadoop in our production environment ),
so I was a little blocked on this. Will look into
it again when we upgrade.
-- It is important to note that if you've got an active
production environment, with lots of diverse jobs
for diverse authors, upgrades can be a very very
big deal...
- There is a perception that Oozie 2 has a much steeper learning
curve than other solutions mentioned...

- General Discussion -

The general discussion floated around a bit, between
comparing systems & sharing experiences...
However, on common theme in our discussions was the inability
of any of these solutions
to manage workflow components outside of the Hadoop cluster
environment. Here is a
good quora thread which discusses this:
(­ ).

Sure, you can throw something together with a bit of DIY
creative scripting
(I've done this a bit with Azkaban)... & it will provide some
management capabilities for components outside of the
However , it will probably not be what I would call a robust
production grade solution.

However, in the discussion, Chris Wensel, came up with what I
think is a
very very good idea..
Talend (­ ) is an
open source
data & workflow framework from (­ ). I've
used talend
for other (non Hadoop projects) & its capabilities & usefulness
for a
production environment... The idea is... to integration talend
with some of the workflow options (Azkaban, Cascading, Oozie )
with talend ...
I think this is definitely work looking into to...

Another point brought up in the discussion is the level of
expertize in our
various production shops.. The range spans a witowards getting
some discussion on production
workfltowards getting some discussion on production
workflow needs...
A former member
Post #: 11
Part 3/3 of these notes, courtesy of Phil Hontalas

A user community of Analysts (no java expertise) submitting Pig
scripts & 24/7 ops personnel who are comfortable with basic unix shell,
(but not much beyond that).
A very sophisticated (Java & Hadoop savvy ) community writing M/R in
Java & ops personnel who are very comfortable with Hadoop & large clusters.

From this, the obvious question is ...
What is the level expertize we should implement for ?

There are a number of opinions on that... I'll just throw in
PH Editorial:
I think if Hadoop is going to be widely accepted in production
current tools have a long way to go.... I would love it if
every application
going into my production systems where thoughtfully crafted ,
designs written in Java. Also, I would be in ecstasy ( & get a
lot more sleep)
if all ops staff were well expert members of Usenix & the ACM
who are well
versed in distributed computing...
However, I'm not holding my breath for that to happen.

I believe that if Hadoop is going to get broad general use in
production environments,
any workflow system will have to be implemented to address the
lower end of the

We ran out of time before we got to any firm conclusions.. The
range of
systems & opinions are wide enough that expecting a conclusion
in this
time range is not very realistic...

All in all a very good start towards getting some discussion on
workflow needs.
Powered by mvnForum

People in this
Meetup are also in:

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy