The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Net::Hadoop::Oozie - Interface to various Oozie REST endpoints and utility methods.

VERSION

version 0.114

DESCRIPTION

This module is a Perl interface to Oozie REST service endpoints and also include some utility methods for some bulk requests and some admin functionality.

SYNOPSIS

    use Net::Hadoop::Oozie;
    my $oozie = Net::Hadoop::Oozie->new( %options );

ACCESSORS

action

api_version

doas

filter

The submission format is filter_key1=filter_value1;filter_key2=...;, but the filters are defined as a hash.

    filter => {
        status => ...,
    }

The valid filters are listed below.

name

The application name from the workflow/coordinator/bundle definition

user

The user that submitted the job

group

The group for the job

status

The status of the job

You need to consider a certain behavior when using filters:

  • The query will do an AND among all the filter names.

  • The query will do an OR among all the filter values for the same name.

  • Multiple values must be specified as different name value pairs.

jobtype

The doc says workflow, coordinator, bundle BUT in CDH 4.4, valid values are '','coordinators' and 'bundles'. workflows and coordinator methods are helper functions setting these values behind the scenes.

len

Defaults to 50.

offset

Defaults to 1.

order

Default is asc, can be asc or desc. For instance, when used on a coordinator in a job call, using desc will put the len most recent actions in the actions key, in most recent order first; the offset is then applied from the end of the list.

show

METHODS

END POINTS

admin

build_version

coord_rerun

coordinators

job

jobs

kill

resume

submit_job

For details about job submission through REST, see https://oozie.apache.org/docs/4.2.0/WebServicesAPI.html#Job_Submission.

Required parameters are listed below.

  • oozie.wf.application.path

    Like /oozie_workflows/myworkflow, must be deployed there already.

  • appName

    How this specific instance will be called, can be anything you want.

Optional parameters are listed below.

Auto variables

If you want some variable interpolated in your script (like a date, an int, or whatever), pass it in the options you call the method with. if you pass foo => 'bar', inside the workflow you will be able to use it as ${foo}.

Configuration properties

Useful parameters for oozie itself (like the queue name) need AFAICT an extra level of handling. they can be set dynamically, but need a tweak in the workflow definition itself, in the top config section; for instance, if we need to specify mapreduce.job.queuename to assign the tasks to a specific fair scheduler queue, we need to declare it in the global configuration section, like this:

    <property>
        <name>mapreduce.job.queuename</name>
        <value>${queueName}</value>
    </property>

And we will call "submit_job" adding this to the options hash:

    queueName => "root.<queue name>"

This method returns a job ID which you can use directly to query the job status, with the "job" method above, so you can launch a job from a script, and have a loop query the job status at regular intervals (be nice, please) to check when it's done (untested code :-).

    my $oozie = Net::Hadoop::Oozie->new;
    my $job_params = [
        { appName => 'job1', myParam => 'foo' },
        { appName => 'job2', myParam => 'bar' },
        ...
    ];
    for my $job (@$job_params) {
        my $jobid = $oozie->submit_job({
            myParam                     => $job->{myParam},
            debug                       => 0, # set to 1 to print the job config and response
            appName                     => $job->{appName},
            'oozie.wf.application.path' => "/wf_base_path/<workflow name>/",
        });
        push @ids, $jobid;
    }

    while (my $jobid = shift @ids) {
        my $status;
        if (($status = $oozie->job($jobid)->{status}) =~ /(WAITING|READY|SUBMITTED|RUNNING)/)) {
            push @ids, $jobid; # put back in the queue
            sleep 10; # or more, how about 60?
        }
        # what do you want to do if not succeeded?
        if ($status !~ /SUCCEEDED/) {
            die "job $jobid died";
        }
    }

workflows

UTILITY METHODS

active_coordinators

active_job_paths

coordinators_on_the_same_path

coordinators_with_the_same_appname_on_the_same_path

Returns a hash consisting of duplicated application names for multiple coordinators. Having coordinators like this is usually an user error when submitting jobs.

    my %offenders = $oozie->coordinators_with_the_same_appname_on_the_same_path;

failed_workflows_last_n_hours

failed_workflows_last_n_hours_pretty

job_exists

This is a sugar interface on top of the "job" method. Normally the REST interface just dies with an HTTP 400 message on missing jobs. This method won't die and will return the data set if there is a proper response from the service. It will return false otherwise.

    if ( my $job = $oozie->job_exists( $id ) ) {
        # do something
    }
    else {
        warn "No such job: $id";
    }

kerberos_enabled

Returns true if kerberos is enabled

max_node_name_len

Returns the value of the hardcoded (in Oozie Java code) MAX_NODE_NAME_LEN value by probing the Oozie server version. This is the maximum length of an Oozie action name that can be in your workflow definitions. If longer action names are deployed and scheduled, then the Oozie server will happily schedule a coordinator but the individual workflow runs will throw exceptions and and no part of the job will get executed. Also note that (if you didn't guess already) oozie validation function will validate and pass such names (unless you have a recent Oozie version which pushes the validation on the server side).

The relevant part in the Oozie source:

    core/src/main/java/org/apache/oozie/util/ParamChecker.java
    private static final int MAX_NODE_NAME_LEN = {Integer};

Currently there is no way to probe the value of this constant through the APIs, but it is possible to map a limit to certain Oozie versions.

Oozie version 4.3.0 and later sets the limit to 128 while anything older than that will have the value 50 (for the time being).

This method, checks the Oozie server version and returns the relevant limit for that version.

See these Oozie Jira tickets for more information:

Checking the limit is especially important if you are deploying the Oozie jobs with custom code generators (instead of hand writing all of the XML) and this helper method will give you the ability to display meaningful exceptions to the users, instead of the obscure Oozie ones in the Oozie console.

oozie_version

Just a sugor interface on top of build_version trying to return the actual numerical Oozie version without the build string.

    my $oozie_version = $oozie->oozie_version;
    # Something like "4.1.0"

standalone_active_workflows

Returns an arrayref of standalone workflows (as in jobs not attached to a coordinator):

    my $wfs_without_a_coordinator = $oozie->standalone_active_workflows;
    foreach my $wf ( @{ $wfs_without_a_coordinator } ) {
        # do something
    }

suspended_coordinators

Returns an arrayref of suspended coordinators:

    my $suspended = $oozie->suspended_coordinators;
    foreach my $coord ( @{ $suspended } ) {
        # do something
    }

suspended_workflows

Returns an arrayref of suspended workflows:

    my $suspended = $oozie->suspended_workflows;
    foreach my $wf ( @{ $suspended } ) {
        # do something
    }