ABOUT ETLp

ETLp is a framework that is configuration driven. Processing is defined within the context of a pipeline job, and a job may in turn call another job. All processing is logged and is viewable from the Runtime Audit Browser. Any errors or warnings are automatically emailed, so basic housekeeping tasks do not need to be performed by the developers.

There are two kinds of ETLp jobs:

A serial job performs all of its defined tasks in order just once
An iterative job performs all of its tasks once for each file that is processed

For example of a serial job might be

FTP a tarball from a remote site.
Untar the retrieved file.
Call an iterative job to process the files.

The called iterative job might be to loop through each file:

Loop through each file and validate its content.
Load the file into a database.
Compress the processed file.

DEFINING ETLp JOBS

A job is defined in a section within an ''application'' configuration file. A jobs are specified using Apache-style configuration syntax.
Each job consists of three phases: pre_process, process and post_process although all are optional. Each phase consists of one of more task items that perform the work. The phases are conceptual, and they make no physical difference to the way that the pipeline functions - they simply help organise or structure the processing.
An item is the unit that defines the actual work. All items have a type which informs ETLp of the plugin that should be used to perform the task.

ETLp comes with a number of plugins that perform common tasks. Additional functionality can be added through custom plugins if required.

Control Files

Control files are used to define the format of data files. These definitions are used to validate files and also by the csv loader to map file fields to database columns.

Configuring the environment

ETLp expects a standard application directory setup. All environment variables are defined in the env.conf file. Further information can be found in the section "Creating Your Application Structure" in ETLp::Manual::Install.

Executing Jobs

A job is started by calling etlp <configuration file name> <section>. The configuration file should have an extension of "conf", but the extension does not need to be specified when invoking etlp, as the extension is assumed. For examples:

    etlp sales load_files

Defining Jobs

All jobs are defined using the Apache configuration syntax.

Iterative Jobs

All iterative jobs must contain some or all of the following settings within the config section

filename_format. A regular expression that will identify the data files to be processed
incoming_dir. The directory where the job will find files to process
archive_dir. Where successfully processed files will be written to
fail_dir. Where files that fail processing will be moved to
table_name. The name of the table that the data will be loaded into
controlile_dir. The directory that hosts the control file
controlfile. The name of the file that defines the data file format
on_error. What to do when an error occurs

Note The on_error value can be one of the following:

die. Raise an error and then exit.
skip. Raise an error, stop processing the current file, and move to the next file.
ignore. Report the error, but carry an processing.

If no value for on_error is specified, the default is die. In addition, the on_error parameter can be set within an item, and the item-level setting will over-ride the job-level setting for that item.

These config parameters may be used by the plugins, but any number of configuration settings can be made and then referred to in the item sections

An example iterative job might be:

<process_customers> type = iterative

    <config>
        filename_format    = (customer\d.csv)(?:\.gz)?$
        incoming_dir       = data/incoming
        archive_dir        = data/archive
        fail_dir           = data/fail          
        table_name         = stg_customer
        controlfile_dir    = conf/control
        controlfile        = customer.ctl
        on_error           = die
    </config>

    <pre_process>
        <item>
            name     = decompress customer file
            type     = gunzip
            </item>
        <item>
            name        = validate customer file
            type        = validate
            file_type   = csv
            skip        = 1
            <csv_options>
                allow_whitespace = 1
                sep_char         = |
            </csv_options>
        </item>
    </pre_process>
    <process>
        <item>
            name = load customer file
            type = csv_loader
            skip = 1
            <csv_options>
                allow_whitespace = 1
                sep_char         = |
            </csv_options>
        </item>
    </process>
    <post_process>
        <item>
            name = compress customer file
            type = gzip
        </item>
    </post_process>
</process_customers>

Iterative Plugins

The standard plugins or iterative processing are:

csv_loader. Loads delimited files.
gunzip. Gunzips files.
gzip. Gzips files.
sql_loader. Load files using Oracle SQL*Loader.
os. Run an Operating System Command.
perl. Call a perl subroutine.
plsql Invoke an Oracle stored procedure or packaged procedure.
steady_state_check. Wait for a file's size to stop changing.
validate. Validate a delimited file against the control file definition.

Additional plugins can be added. See Writing Custom Plugins

Serial Jobs

Typically, serial jobs don't require a config section but may include one if there is a need to share settings or parameter values across mutiple items. Below is an example serial job:

    <get_customer_file>
        type = serial

        <config>
            next = customer load_files
        </config>

        <process>
            <item>
                name    = get customer files
                type    = os
                command = %app_root%/bin/ftp_customer_file.pl
            </item>
            <item>
                name    = Untar customer files
                type    = os
                command = tar zxf %app_root%/data/incoming/customer.tar.gz
            </item>
        </process>
    </get_customer_file>

Note The on_error value can be set in the config section:

die. Raise an error and then exit.
ignore. Report the error, but carry an processing.

Unlike iterative jobs, serial jobs have no concept of "skip," since it does not make sense within this context. The default is ''die''.

Serial Plugins

The standard plugins or iterative processing are:

os. Run an Operating System Command
perl. Call a perl subroutine
plsql. Invoke an Oracle stored procedure or packaged procedure
watch. File watcher. Watch a directory, waiting for the appearance of one or more files that match the specified pattern

Additional plugins can be added. See Writing Custom Plugins

USING CONFIGURATION PARAMETERS

The value of any parameters specified in the config section can be referenced be specifying the parameter within percentage signs. For example:

    mv %basename(filename)% %archive_dir%

Default parameters

The following parameters are maintained by the ETLp framework itself

app_root. The application root directory.
filename. The current file being processed. Applicable to iterative jobs only.
basename(filename) || The current file being processed, stripped of the path information. Applicable to iterative jobs only.

Environment Configuration Settings

If ''allow_env_vars'' is set in the environment configuration file (env.conf), then values in env.conf can be referred to by their parameter names. This can be useful if there is sensitive information that should not be placed in the application configuration files - for instance, the application configuration files are committed to a source control system, but you need passwords to connect to other systems as part of the ETL processing.

However, allowing applications to access environment parameters means that all settings are accessible to all plugins, including custom ones. This may be considered a security issue, which is why these parameters are not accessible by default.

CHAINING JOBS

Use the ''next'' config parameter to call another job upon completion of the current job:

  <my_current_job>
     <config>
        next = <<config name>> <<section name>>
     <config>
  </my_current_job>

Note that the next job will run even if the current job dies with errors.

To install ETLp, copy and paste the appropriate command in to your terminal.

cpanm

cpanm ETLp

CPAN shell

perl -MCPAN -e shell
install ETLp

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)