First off, what is Oozie?
From the official website at http://oozie.apache.org/ we get:
Apache Oozie Workflow Scheduler for Hadoop
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.
Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).
Oozie is a scalable, reliable and extensible system.
How does that make life easier, and you more attractive to potential mates? Well, I’ll tell you how.
- It makes chaining together more complex Hadoop jobs really easy
- It allows your code to be way more maintainable, since you can simply chain together a bunch of much simpler jars to create more complex workflows
- It allows you to fork jobs, so that you can get more than one thing done at once.
- It runs all your code natively as MapReduce jobs within Hadoop, so making best use of your infrastructure
- Defining actions is dead easy. You can use a web based Oozie editor where you drag and drop your tasks together, or you can simply hack up an easy to use XML document to define your workflow(s).
- It has additional actions built in to do cool things like execute shell scripts, do Java based actions, use the filesystem (create and delete dirs, files etc), email you when done, and many more. All done, all free (no extra work)
- It allows you to schedule jobs. This is way cooler than it sounds.
An example of an Oozie workflow may be something like:
- Copy data file from ftp server
- Parse data file. 2.1 Load relevant data to HBase
- Run MapReduce on Data
- Process data further and pass to another Jar to populate website data
- Email you to tell you the job is done
If that doesn’t seem cool, then I don’t know what is!