The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

ABOUT ETLp

ETLp is a framework that is configuration driven. Processing is defined within the context of a pipeline job, and a job may in turn call another job. All processing is logged and is viewable from the Runtime Audit Browser. Any errors or warnings are automatically emailed, so basic housekeeping tasks do not need to be performed by the developers.

There are two kinds of ETLp jobs:

  • A serial job performs all of its defined tasks in order just once

  • An iterative job performs all of its tasks once for each file that is processed

For example of a serial job might be

  • FTP a tarball from a remote site.

  • Untar the retrieved file.

  • Call an iterative job to process the files.

The called iterative job might be to loop through each file:

  • Loop through each file and validate its content.

  • Load the file into a database.

  • Compress the processed file.

DEFINING ETLp JOBS

  • A job is defined in a section within an ''application'' configuration file. A jobs are specified using Apache-style configuration syntax.

  • Each job consists of three phases: pre_process, process and post_process although all are optional. Each phase consists of one of more task items that perform the work. The phases are conceptual, and they make no physical difference to the way that the pipeline functions - they simply help organise or structure the processing.

  • An item is the unit that defines the actual work. All items have a type which informs ETLp of the plugin that should be used to perform the task.

ETLp comes with a number of plugins that perform common tasks. Additional functionality can be added through custom plugins if required.

Control Files

Control files are used to define the format of data files. These definitions are used to validate files and also by the csv loader to map file fields to database columns.

Configuring the environment

ETLp expects a standard application directory setup. All environment variables are defined in the env.conf file. Further information can be found in the section "Creating Your Application Structure" in ETLp::Manual::Install.

Executing Jobs

A job is started by calling etlp <configuration file name> <section>. The configuration file should have an extension of "conf", but the extension does not need to be specified when invoking etlp, as the extension is assumed. For examples:

    etlp sales load_files

Defining Jobs

All jobs are defined using the Apache configuration syntax.

Iterative Jobs

All iterative jobs must contain some or all of the following settings within the config section

  • filename_format. A regular expression that will identify the data files to be processed

  • incoming_dir. The directory where the job will find files to process

  • archive_dir. Where successfully processed files will be written to

  • fail_dir. Where files that fail processing will be moved to

  • table_name. The name of the table that the data will be loaded into

  • controlile_dir. The directory that hosts the control file

  • controlfile. The name of the file that defines the data file format

  • on_error. What to do when an error occurs

Note The on_error value can be one of the following:

  • die. Raise an error and then exit.

  • skip. Raise an error, stop processing the current file, and move to the next file.

  • ignore. Report the error, but carry an processing.

If no value for on_error is specified, the default is die. In addition, the on_error parameter can be set within an item, and the item-level setting will over-ride the job-level setting for that item.

These config parameters may be used by the plugins, but any number of configuration settings can be made and then referred to in the item sections

An example iterative job might be:

<process_customers> type = iterative

    <config>
        filename_format    = (customer\d.csv)(?:\.gz)?$
        incoming_dir       = data/incoming
        archive_dir        = data/archive
        fail_dir           = data/fail          
        table_name         = stg_customer
        controlfile_dir    = conf/control
        controlfile        = customer.ctl
        on_error           = die
    </config>

    <pre_process>
        <item>
            name     = decompress customer file
            type     = gunzip
            </item>
        <item>
            name        = validate customer file
            type        = validate
            file_type   = csv
            skip        = 1
            <csv_options>
                allow_whitespace = 1
                sep_char         = |
            </csv_options>
        </item>
    </pre_process>
    <process>
        <item>
            name = load customer file
            type = csv_loader
            skip = 1
            <csv_options>
                allow_whitespace = 1
                sep_char         = |
            </csv_options>
        </item>
    </process>
    <post_process>
        <item>
            name = compress customer file
            type = gzip
        </item>
    </post_process>
</process_customers>

Iterative Plugins

The standard plugins or iterative processing are:

  • csv_loader. Loads delimited files.

  • gunzip. Gunzips files.

  • gzip. Gzips files.

  • sql_loader. Load files using Oracle SQL*Loader.

  • os. Run an Operating System Command.

  • perl. Call a perl subroutine.

  • plsql Invoke an Oracle stored procedure or packaged procedure.

  • steady_state_check. Wait for a file's size to stop changing.

  • validate. Validate a delimited file against the control file definition.

Additional plugins can be added. See Writing Custom Plugins

Serial Jobs

Typically, serial jobs don't require a config section but may include one if there is a need to share settings or parameter values across mutiple items. Below is an example serial job:

    <get_customer_file>
        type = serial

        <config>
            next = customer load_files
        </config>

        <process>
            <item>
                name    = get customer files
                type    = os
                command = %app_root%/bin/ftp_customer_file.pl
            </item>
            <item>
                name    = Untar customer files
                type    = os
                command = tar zxf %app_root%/data/incoming/customer.tar.gz
            </item>
        </process>
    </get_customer_file>

Note The on_error value can be set in the config section:

  • die. Raise an error and then exit.

  • ignore. Report the error, but carry an processing.

Unlike iterative jobs, serial jobs have no concept of "skip," since it does not make sense within this context. The default is ''die''.

Serial Plugins

The standard plugins or iterative processing are:

  • os. Run an Operating System Command

  • perl. Call a perl subroutine

  • plsql. Invoke an Oracle stored procedure or packaged procedure

  • watch. File watcher. Watch a directory, waiting for the appearance of one or more files that match the specified pattern

Additional plugins can be added. See Writing Custom Plugins

USING CONFIGURATION PARAMETERS

The value of any parameters specified in the config section can be referenced be specifying the parameter within percentage signs. For example:

    mv %basename(filename)% %archive_dir%

Default parameters

The following parameters are maintained by the ETLp framework itself

  • app_root. The application root directory.

  • filename. The current file being processed. Applicable to iterative jobs only.

  • basename(filename) || The current file being processed, stripped of the path information. Applicable to iterative jobs only.

Environment Configuration Settings

If ''allow_env_vars'' is set in the environment configuration file (env.conf), then values in env.conf can be referred to by their parameter names. This can be useful if there is sensitive information that should not be placed in the application configuration files - for instance, the application configuration files are committed to a source control system, but you need passwords to connect to other systems as part of the ETL processing.

However, allowing applications to access environment parameters means that all settings are accessible to all plugins, including custom ones. This may be considered a security issue, which is why these parameters are not accessible by default.

CHAINING JOBS

Use the ''next'' config parameter to call another job upon completion of the current job:

  <my_current_job>
     <config>
        next = <<config name>> <<section name>>
     <config>
  </my_current_job>

Note that the next job will run even if the current job dies with errors.