The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

ETL::Pipeline - Extract-Transform-Load pattern for data file conversions

SYNOPSIS

  use ETL::Pipeline;

  # The object oriented interface...
  ETL::Pipeline->new( {
    work_in   => {search => 'C:\Data', iname => qr/Ficticious/},
    input     => ['Excel', iname => qr/\.xlsx?$/              ],
    mapping   => {Name => 'A', Address => 'B', ID => 'C'      },
    constants => {Type => 1, Information => 'Demographic'     },
    output    => ['Memory', key => 'ID'                       ],
  } )->process;

  # Or using method calls...
  my $etl = ETL::Pipeline->new;
  $etl->work_in  ( search => 'C:\Data', iname => qr/Ficticious/ );
  $etl->input    ( 'Excel', iname => qr/\.xlsx?$/i              );
  $etl->mapping  ( Name => 'A', Address => 'B', ID => 'C'       );
  $etl->constants( Type => 1, Information => 'Demographic'      );
  $etl->output   ( 'Memory', key => 'ID'                        );
  $etl->process;

DESCRIPTION

ETL stands for Extract-Transform-Load. ETL isn't just for Data Warehousing. ETL works on almost any type of data conversion. You read the source, translate the data for your target, and store the result.

By dividing a conversion into 3 steps, we isolate the input from the output...

  • Centralizes data formatting and validation.

  • Makes new input formats a breeze.

  • Makes new outputs just as easy.

ETL::Pipeline takes your data files from extract to load. It reads an input source, translates the data, and writes it to an output destination. For example, this pipeline reads an Excel spread sheet (input) and saves the information in a Perl hash (output).

  use ETL::Pipeline;
  ETL::Pipeline->new( {
    work_in   => {search => 'C:\Data', find => qr/Ficticious/},
    input     => ['Excel', find => qr/\.xlsx?$/],
    mapping   => {Name => 'A', Complaint => 'B', ID => 'C'},
    constants => {Client => 1, Type => 'Complaint'}
    output    => ['Memory', key => 'ID']
  } )->process;

Or like this, calling the methods instead of through the constructor...

  use ETL::Pipeline;
  my $etl = ETL::Pipeline->new;
  $etl->work_in  ( search => 'C:\Data', find => qr/Ficticious/ );
  $etl->input    ( 'Excel', find => qr/\.xlsx?$/               );
  $etl->mapping  ( Name => 'A', Complaint => 'B', ID => 'C'    );
  $etl->constants( Client => 1, Type => 'Complaint'            );
  $etl->output   ( 'Memory', key => 'ID'                       );
  $etl->process;

These are equivalent. They do exactly the same thing. You can pick whichever best suits your style.

What is a pipeline?

The term pipeline describes a complete ETL process - extract, transform, and load. Or more accurately - input, mapping, output. Raw data enters one end of the pipe (input) and useful information comes out the other (output). An ETL::Pipeline object represents a complete pipeline.

Upgrade Warning

WARNING: The API for input sources has changed in version 3.00. Custom input sources written for an earlier version will not work with version 3.00 and later. You will need to re-write your custom input sources.

See ETL::Pipeline::Input for more details.

METHODS & ATTRIBUTES

Managing the pipeline

new

Create a new ETL pipeline. The constructor accepts a hash reference whose keys are ETL::Pipeline attributes. See the corresponding attribute documentation for details about acceptable values.

aliases
constants
data_in
input
mapping
on_record
output
session
work_in

aliases

aliases defines alternate names for input fields. This is how column headers work, for example. You can define your own shortcuts using this method or declaring aliases in "new". Aliases can make complex field names more readable.

aliases accepts a list of hash references. Each hash reference has one or more alias-to-field definitions. The hash key is the alias name. The value is any field name recognized by "get".

Aliases are resolved in the order they are added. That way, your pipelines know where each value came from, if that's important. Aliases set by the input source always sort before aliases set by the script. Within a hash, all definitions are considered equal and may sort in any order.

  # Array definitions to force sorting.
  my $etl = ETL::Pipeline->new( {aliases => [{A => '0'}, {B => '1'}], ...} );
  $etl->aliases( {C => '2'}, {D => '3'} );

  # Hash where it can sort either way.
  my $etl = ETL::Pipeline->new( {aliases => {A => '0', B => '1'}, ...} );
  $etl->aliases( {C => '2', D => '3'} );

aliases returns a sorted list of all aliases for fields in this input source.

I recommend using the hash, unless order matters. In that case, use the array form instead.

Special Note: Custom input sources call aliases to add their own shortcuts, such as column headers. These aliases are always evaluated before those set by "new" or calls to this method by the script.

chain

This method creates a new pipeline using the same "work_in", "data_in", and "session" as the current pipeline. It returns a new instance of ETL::Pipeline.

chain takes the same arguments as "new". It passes those arguments through to the constructor of the new object.

See the section on "Multiple input sources" for examples of chaining.

constants

constants sets output fields to literal values. "mapping" accepts input field names as strings. Instead of obtuse Perl tricks for marking literals, constants explicitly handles them.

Hash keys are output field names. The "output" class defines acceptable field names. The hash values are literals.

  # Set the output field "Name" to the string "John Doe"...
  $etl->constants( Name => 'John Doe' );

  # Get the current list of constants...
  my $transformation = $etl->constants;

Note: constants does not accept code references, array references, or hash references. It only works with literal values. Use "mapping" instead for calculated items.

With no parameters, constants returns the current hash reference. If you pass in a hash reference, constants replaces the current hash with this new one. If you pass in a list of key value pairs, constants adds them to the current hash.

data_in

The working directory ("work_in") usually contains the raw data files. In some cases, though, the actual data sits in a subdirectory underneath "work_in". data_in tells the pipeline where to find the input file.

data_in accepts a search pattern - name, glob, or regular expression. It searches "work_in" for the first matching directory. The search is case insensitive.

If you pass an empty string to data_in, the pipeline resets data_in to the "work_in" directory. This is useful when chaining pipelines. If one changes the data directory, the next in line can change back.

input

input sets and returns the ETL::Pipeline::Input object. This object reads the data. With no parameters, input returns the current ETL::Pipeline::Input object.

  my $source = $etl->input();

Set the input source by calling input with parameters...

  $etl->input( 'Excel', find => qr/\.xlsx/i );

The first parameter is a class name. input looks for a Perl module matching this name in the ETL::Pipeline::Input namespace. In this example, the actual class name becomes ETL::Pipeline::Input::Excel.

The rest of the parameters are passed directly to the new method of that class.

Technical Note: Want to use a custom class from Local instead of ETL::Pipeline::Input? Put a + (plus sign) in front of the class name. For example, this command uses the input class Local::CustomExtract.

  $etl->input( '+Local::CustomExtract' );

mapping

mapping ties the input fields with the output fields. Hash keys are output field names. The "output" class defines acceptable field names. The hash values can be anything accepted by the "get" method. See "get" for more information.

  # Add the output field "Name" with data from input column "A"...
  $etl->mapping( Name => 'A' );

  # Change "Name" to get data from "Full Name" or "FullName"...
  $etl->mapping( Name => qr/Full\s*Name/i );

  # "Name" gets the lower case of input column "A"...
  $etl->mapping( Name => sub {
    my ($etl, $record) = @_;
    return lc $record{A};
  } );

If "get" returns an ARRAY reference (aka multiple values), they will be concatenated in the output with a semi-colon between values - ; . You can override the seperator by setting the value to an ARRAY reference. The first element is a regular field name for "get". The second element is a new seperator string.

  # Slashes between multiple names.
  $etl->mapping( Name => [qr/Name/i, ' / '] );
  
  # These will do the same thing - semi-colon between multiple names.
  $etl->mapping( Name => [qr/Name/i, '; '] );
  $etl->mapping( Name => qr/Name/i );

With no parameters, mapping returns the current hash reference. If you pass in a hash reference, mapping replaces the current hash with this new one. If you pass in a list of key value pairs, mapping adds them to the current hash.

  # Get the current mapping...
  my $transformation = $etl->mapping;

  # Add the output field "Name" with data from input column "A"...
  $etl->mapping( Name => 'A' );

  # Replace the entire mapping so only "Name" is output...
  $etl->mapping( {Name => 'C'} );

Want to save a literal value? Use "constants" instead.

Complex data structures

mapping only sets scalar values. If the matching fields contain sub-records, "record" throws an error message and sets the output field to undef.

Fully customized mapping

mapping accepts a CODE reference in place of the hash. In this case, "record" executes the code and uses the return value as the record to send "output". The CODE should return a hash reference for success or undef if there is an error.

  # Execute code instead of defining the output fields.
  $etl->mapping( sub { ... } );

  # These are the same.
  $etl->mapping( {Name => 'A'} );
  $etl->mapping( sub {
    my $etl = shift; 
    return {Name => $etl->get( 'A' )};
  } );

undef saves an empty record. To print an error message, have your code call "status" with a type of ERROR.

  # Return an enpty record.
  $etl->mapping( sub { undef; } );
  
  # Print an error message.
  $etl->mapping( sub {
    ...
    $etl->status( 'ERROR', 'There is no data!' );
    return undef;
  });

The results of "constants" are folded into the resulting hash reference. Fields set by mapping override constants.

  # Output record has two fields - "Extra" and "Name".
  $etl->constants( Extra => 'ABC' );
  $etl->mapping( sub { {Name => shift->get( 'A' )} } );
  
  # Output record has only one field, with the value from the input record.
  $etl->constants( Name => 'ABC' );
  $etl->mapping( sub { {Name => shift->get( 'A' )} } );  

"record" passes two parameters into the CODE reference - the ETL::Pipeline object and the raw data record.

  $etl->mapping( sub {
    my ($etl, $record) = @_;
    ...
  } );

WARNING: This is considered an advanced feature and should be used sparingly. You will find the name => field format easier to maintain.

on_record

Executes a customized subroutine on every record before any mapping. The code can modify the record and your changes will feed into the mapping. You can use on_record for filtering, debugging, or just about anything.

on_record accepts a code reference. "record" executes this code for every input record.

The code reference receives two parameters - the ETL::Pipeline object and the input record. The record is passed as a hash reference. If on_record returns a false value, "record" will never send this record to the output destination. It's as if this record never existed.

  ETL::Pipeline->new( {
    ...
    on_record => sub {
      my ($etl, $record) = @_;
      foreach my $field (keys %$record) {
        my $value = $record->{$field};
        $record->{$field} = ($value eq 'NA' ? '' : $value);
      }
    },
    ...
  } )->process;

  # -- OR --
  $etl->on_record( sub {
    my ($etl, $record) = @_;
    foreach my $field (keys %$record) {
      my $value = $record->{$field};
      $record->{$field} = ($value eq 'NA' ? '' : $value);
    }
  } );

Note: "record" automatically removes leading and trailing whitespace. You do not need on_record for that.

output

output sets and returns the ETL::Pipeline::Output object. This object writes records to their final destination. With no parameters, output returns the current ETL::Pipeline::Output object.

Set the output destination by calling output with parameters...

  $etl->output( 'SQL', table => 'NewData' );

The first parameter is a class name. output looks for a Perl module matching this name in the ETL::Pipeline::Output namespace. In this example, the actual class name becomes ETL::Pipeline::Output::SQL.

The rest of the parameters are passed directly to the new method of that class.

Technical Note: Want to use a custom class from Local instead of ETL::Pipeline::Output? Put a + (plus sign) in front of the class name. For example, this command uses the input class Local::CustomLoad.

  $etl->output( '+Local::CustomLoad' );

process

process kicks off the entire data conversion process. It takes no parameters. All of the setup is done by the other methods.

process returns the ETL::Pipeline object so you can do things like this...

  ETL::Pipeline->new( {...} )->process->chain( ... )->process;

session

ETL::Pipeline supports sessions. A session allows input and output objects to share information along a chain. For example, imagine 3 Excel files being loaded into an Access database. All 3 files go into the same Access database. The first pipeline creates the database and saves its path in the session. That pipeline chains with a second pipeline. The second pipeline retrieves the Access filename from the session.

The session method provides access to session level variables. As you write your own ETL::Pipeline::Output classes, they can use session variables for sharing information.

The first parameter is the variable name. If you pass only the variable name, session returns the value.

  my $database = $etl->session( 'access_file' );
  my $identifier = $etl->session( 'session_identifier' );

A second parameter is the value.

  $etl->session( access_file => 'C:\ExcelData.accdb' );

You can set multiple variables in one call.

  $etl->session( access_file => 'C:\ExcelData.accdb', name => 'Abe' );

If you pass in a hash referece, it completely replaces the current session with the new values.

When retrieving an array or hash reference, session automatically derefernces it if called in a list context. In a scalar context, session returns the reference.

  # Returns the list of names as a list.
  foreach my $name ($etl->session( 'name_list' )) { ... }

  # Returns a list reference instead of a list.
  my $reference = $etl->session( 'name_list' );

session_has

session_has checks for a specific session variable. It returns true if the variable exists and false if it doesn't.

session_has only checks existence. It does not tell you if the value is defined.

  if ($etl->session_has( 'access_file' )) { ... }

work_in

The working directory sets the default place for finding files. All searches start here and only descend subdirectories. Temporary or output files go into this directory as well.

work_in has two forms: work_in( 'C:\Data' ); or work_in( root => 'C:\Data', iname => 'Ficticious' );.

The first form specifies the exact directory path. In our example, the working directory is C:\Data.

The second form searches the file system for a matching directory. Take this example...

  $etl->work_in( root => 'C:\Data', iname => 'Ficticious' );

It scans the C:\Data directory for a subdirectory named Fictious, like this: C:\Data\Ficticious. The search is not recursive. It locates files in the root folder.

work_in accepts any of the tests provided by Path::Iterator::Rule. The values of these arguments are passed directly into the test. For boolean tests (e.g. readable, exists, etc.), pass an undef value.

work_in automatically applies the directory filter. Do not set it yourself.

iname is the most common one that I use. It matches the file name, supports wildcards and regular expressions, and is case insensitive.

  # Search using a regular expression...
  $etl->work_in( iname => qr/\.xlsx$/, root => 'C:\Data' );

  # Search using a file glob...
  $etl->work_in( iname => '*.xlsx', root => 'C:\Data' );

The code throws an error if no directory matches the criteria. Only the first match is used.

work_in automatically resets "data_in".

Used in mapping

These methods may be used by code references in the "mapping" attribute. They will return information about/from the current record.

count

This attribute tells you how many records have been read from the input source. This value is incremented before any filtering. So it even counts records that are bypassed by "on_record".

The first record is always number 1.

get

Retrieve the value from the record for the given field name. This method accepts two parameters - a field name and the current record. It returns the exact value found at the matching node.

Field names

The field name can be...

A string containing a hash key
An array index (all digits)
A string containing a Data::DPath path (starts with /)
A regular expression reference
A code reference

Hash keys and regular expressions match both field names and aliases. These are the only types that match aliases. Hash keys cannot be all digits and cannot begin with the / character. Otherwise get mis-identifies them.

get interprets strings of all digits as array indexes (numbers). Excel files, for example, return an array instead of a hash. And this is an easy way to reference columns in order.

get treats a string beginning with a / (slash) as a Data::DPath path. This lets you very specifically traverse a complex data sturcture, such as those from XML or JSON.

For a regular expression, get matches hash keys at the top level of the data structure plus aliases.

And with a a code reference, get executes the subroutine. The return value becomes the field value. The code reference is called in a scalar context. If you need to return multiple values, then return an ARRAY or HASH reference.

A code reference receives two parameters - the ETL::Pipeline object and the current record.

Current record

The current record is optional. get will use "this" if you do not pass in a record. By accepting a record, you can use get on sub-records. So by default, get returns a value from the top record. Use the second parameter to retrieve values from a sub-record.

get only applies aliases when using "this". Aliases do not apply to sub-records.

Return value

get always returns a scalar value, but not always a string. The return value might be a string, ARRAY reference, or HASH reference.

get does not flatten out the nodes that it finds. It merely returns a reference to whatever is in the data structure at the named point. The calling code must account for the possibility of finding an array or hash or string.

this

The current record. The "record" method sets this before it does anything else. "get" will use this if you don't pass in a record. It makes a convenient shortcut so you don't have to pass the record into every call.

this can be any valid Perl data. Usually a hash reference or array reference. The input source controls the type.

Used by input sources

aliases (see above)

Your input source can use the "aliases" method documented above to set column headers as field names. Excel files, for example, would call "aliases" to assign letters to column numbers, like a real spreadsheet.

record

The input source calls this method for each data record. This is where ETL::Pipeline applies the mapping, constants, and sends the results on to the ETL::Pipeline applies the mapping, constants, and sends the results on to the output destination.

record takes one parameter - he current record. The record can be any Perl data structure - hash, array, or scalar. record uses Data::DPath to traverse the structure.

record calls "get" on each field in "mapping". /get traverses the data structure retrieving the correct values. record concatenates multiple matches into a single, scalar value for the output.

status

This method displays a status message. ETL::Pipeline calls this method to report on the progress of pipeline. It takes one or two parameters - the message type (required) and the message itself (optional).

The type can be anything. These are the ones that ETL::Pipeline uses...

DEBUG

Messages used for debugging problems. You should only use these temporarily to look for specific issues. Otherwise they clog up the display for the end user.

END

The pipeline has finished. The input source is closed. The output destination is still open. It will be closed immediately after. There is no message text.

ERROR

Report an error message to the user. These are not necessarily fatal errors.

INFO

An informational message to the user.

START

The pipeline is just starting. The output destination is open. But the input source is not. There is no message text.

STATUS

Progress update. This is sent every after every input record.

See "Custom logging" for information about adding your own log method.

Utility Functions

These methods can be used inside "mapping" code references. Unless otherwise noted, these all work on the current record.

  my $etl = ETL::Pipeline->new( {
    ...
    mapping => {A => sub { shift->function( ... ) }},
    ...
  } );

coalesce

Emulates the SQL Server COALESCE command. It takes a list of field names for "get" and returns the value of the first non-blank field.

  # First non-blank field
  $etl->coalesce( 'Patient', 'Complainant', 'From' );

  # Actual value if no non-blank fields
  $etl->coalesce( 'Date', \$today );

In the first example, coalesce looks at the Patient field first. If it's blank, then coalesce looks at the Complainant field. Same thing - if it's blank, coalesce returns the From field.

Blank means undef, empty string, or all whitespace. This is different than the SQL version.

The second examples shows an actual value passed as a scalar reference. Because it's a reference, coalesce recognizes that it is not a field name for "get". coalesce uses the value in $today if the Date field is blank.

coalesce returns an empty string if all of the fields are blank.

foreach

Executes a CODE reference against repeating sub-records. XML files, for example, have repeating nodes. foreach allows you to format multiple fields from the same record. It looks like this...

  # Capture the resulting strings.
  my @results = $etl->foreach( sub { ... }, '/File/People' );

  # Combine the resulting strings.
  join( '; ', $etl->foreach( sub { ... }, '/File/People' ) );

foreach calls "get" to retrieve a list of sub-records. It replaces "this" with each sub-record in turn and executes the code reference. You can use any of the standard unitlity functions inside the code reference. They will operate only on the current sub-record.

foreach returns a single string per sub-record. Blank strings are discarded. Blank means undef, empty strings, or all whitespace. You can filter sub-records by returning undef from the code reference.

For example, you might do something like this to format names from XML...

  # Format names "last, first" and put a semi-colon between multiple names.
  $etl->format( '; ', $etl->foreach(
    sub { $etl->format( ', ', '/Last', '/First' ) },
    '/File/People'
  ) );

  # Same thing, but using parameters.
  $etl->format( '; ', $etl->foreach(
    sub {
      my ($object, $record) = @_;
      $object->format( ', ', '/Last', '/First' )
    },
    '/File/People'
  ) );

foreach passed two parameters to the code reference...

The current ETL::Pipeline object.
The current sub-record. This will be the same value as "this".

The code reference should return a string. If it returns an ARRAY reference, foreach flattens it, discarding any blank elements. So if you have to return multiple values, foreach tries to do something intelligent.

foreach sets "this" before executing the CODE reference. The code can call any of the other utility functions with field names relative to the sub-record. Please note, the code cannot access fields outside of the sub-record. Instead, cache these in a local variable before called foreach.

  my $x = $etl->get( '/File/MainPerson' );
  join( '; ', $etl->foreach( sub {
    my $y = $etl->format( ', ', '/Last', '/First' );
    "$y is with $x";
  }, '/File/People' );

Calling foreach

foreach accepts the code reference as the first parameter. All remaining parameters are field names. foreach passes them through "get" one at a time. Each field should resolve to a repeating node.

foreach returns a list. The list may be empty or have one element. But it is always a list. You can use Perl functions such as join to convert the list into a single value.

format

Builds a string from a list of fields, discarding blank fields. That's the main purpose of the function - don't use entirely blank strings. This prevents things like orphanded commas from showing up in your data.

format can both concateneate (join) fields or format them (sprintf). A SCALAR reference signifies a format. A regular string indicates concatenation.

  # Concatenate fields (aka join)
  $etl->format( "\n\n", 'D', 'E', 'F' );

  # Format fields (aka sprintf)
  $etl->format( \'%s, %s (%s)', 'D', 'E', 'F' );

You can nest constructs with an ARRAY reference. The seperator or format string is the first element. The remaining elements are more fields (or other nested ARRAY references). Basically, format recursively calls itself passing the array as parameters.

  # Blank lines between. Third line is two fields seperated by a space.
  $etl->format( "\n\n", 'D', 'E', [' ', 'F', 'G'] );

  # Blank lines between. Third line is formatted.
  $etl->format( "\n\n", 'D', 'E', [\'-- from %s %s', 'F', 'G'] );

Blank means undef, empty string, or all whitespace. format returns an empty string if all of fields are blank.

Format until

format optionally accepts a CODE reference to stop processing early. format passes each value into the code reference. If the code returns true, then format stops processing fields and returns. The code reference comes before the seperator/format.

  # Concantenate fields until one of them is the word "END".
  $etl->format( sub { $_ eq 'END' }, "\n\n", '/*[idx > 8]' );

format sets $_ to the field value. It also passes the value as the first and only parameter. Your code can use either $_ or shift to access the value.

You can include code references inside an ARRAY reference too. The code only stops processing inside that substring. It continues processing the outer set of fields after the ARRAY.

  # The last line concatenates fields until one of them is the word "END".
  $etl->format( "\n\n", 'A', 'B', [sub { $_ eq 'END' }, ' ', '/*[idx > 8]'] );

  # Do the conditional concatenate in the middle. Results in 3 lines.
  $etl->format( "\n\n", 'A', [sub { $_ eq 'END' }, ' ', '/*[idx > 8]'], 'B' );

What happens if you have a CODE reference and an ARRAY reference, like this?

  $etl->format( sub { $_ eq 'END' }, "\n\n", 'A', [' ', 'B', 'C'], 'D' );

format retrieves the ARRAY reference as a single string. It then sends that entire string through the CODE reference. If the code returns true, processing stops. In other words, format treats the results of an ARRAY reference just like any other field.

from

Return data from a hash, like the one from ETL::Pipeline::Output::Memory. The first parameter is the hash reference. The remaining parameters are field names whose values become the hash keys. It's a convenient shorthand for accessing a hash, with all of the error checking built in.

  $etl->from( $etl->output->hash, qr/myID/i, qr/Site/i );

To pass a string literal, use a scalar reference.

  $etl->from( \%hash, qr/myID/i, \'Date' );

This is equivalent to...

  $hash{$etl->get( qr/myID/i )}->{'Date'}

from returns undef is any one key does not exist.

from automatically dereferences arrays. So if you store multiple values, the function returns them as a list instead of the list reference. Scalar values and hash references are returned as-is.

name

Format fields as a person's name. Names are common data elements. This function provides a common format. Yet is flexible enough to handle customization.

  # Simple name formatted as "last, first".
  $etl->name( 'Last', 'First' );

  # Simple name formatted "first last". The format is the first element.
  $etl->name( \'%s %s', 'First', 'Last' );

  # Add a role or description in parenthesis, if it's there.
  $etl->name( 'Last', 'First', ['Role'] );

  # Add two fields with a custom format if at least one exists.
  $etl->name( 'Last', 'First', [\'(%s; %s)', 'Role', 'Type'] );

  # Same thing, but only adds the semi-colon if both values are there.
  $etl->name( 'Last', 'First', [['; ', 'Role', 'Type']] );

  # Long hand way of writing the above.
  $etl->name( 'Last', 'First', [\'(%s)', ['; ', 'Role', 'Type']] );

If name doesn't do what you want, try "build". "build" is more flexible. As a matter of fact, name calls "build" internally.

piece

Split a string and extract one or more of the individual pieces. This can come in handy with file names, for example. A file split on the period has two pieces - the name and the extension, piece 1 and piece 2 respectively. Here are some examples...

  # File name: Example.JPG
  # Returns: Example
  $etl->piece( 'Filename', qr|\.|, 1 );

  # Returns: JPG
  $etl->piece( 'Filename', qr|\.|, 2 );

piece takes a minimum of 3 parameters...

1. Any field name valid for "get"
2. Regular expression for splitting the field
3. Piece number to extract (the first piece is 1, not 0)

piece accepts any field name valid with "get". Multiple values are concatenated with a single space. You can specify a different seperator using the same syntax as "mapping" - an array reference. In that array, the first element is the field name and the second is the seperator string.

The second parameter for piece is a regular expression. piece passes this to split and breaks apart the field value.

The third parameter returns one or more pieces from the split string. In the simplest form, this is a single number. And piece returns that piece from the split string. Note that pieces start at number 1, not 0 like array indexes.

A negative piece number starts from the end of the string. For example, -2 returns the second to last piece. You can also include a length - number of pieces to return starting at the given position. The default length is 1.

  # Filename: abc_def_ghi_jkl_mno_pqr
  # Returns: abc def
  $etl->piece( 'Filename', qr/_/, '1,2' );

  # Returns: ghi jkl mno
  $etl->piece( 'Filename', qr/_/, '3,3' );

  # Returns: mno pqr
  $etl->piece( 'Filename', qr/_/, '-2,2' );

Notice that the multiple pieces are re-joined using a space. You can specify the seperator string after the length. Do not put spaces after the commas. piece will mistakenly use it as part of the seperator.

  # Filename: abc_def_ghi_jkl_mno_pqr
  # Returns: abc+def
  $etl->piece( 'Filename', qr/_/, '1,2,+' );

  # Returns: ghi,jkl,mno
  $etl->piece( 'Filename', qr/_/, '3,3,,' );

  # Returns: ghi -jkl -mno
  $etl->piece( 'Filename', qr/_/, '3,3, -' );

A blank length returns all pieces from the start position to the end, just like the Perl splice function.

  # Filename: abc_def_ghi_jkl_mno_pqr
  # Returns: ghi jkl mno pqr
  $etl->piece( 'Filename', qr/_/, '3,' );

  # Returns: ghi+jkl+mno+pqr
  $etl->piece( 'Filename', qr/_/, '3,,+' );

Recursive pieces

Imagine a name like Public, John Q., MD. How would you parse out the middle initial by hand? First, you piece the string by comma. Next you split the second piece of that by a space. piece lets you do the same thing.

  # Name: Public, John Q., MD
  # Returns: Q.
  $etl->piece( 'Name', qr/,/, 2, qr/ /, 2 );

  # Returns: John
  $etl->piece( 'Name', qr/,/, 2, qr/ /, 1 );

piece will take the results from the first split and use it as the input to the second split. It will continue to do this for as many pairs of expressions and piece numbers as you send.

replace

Substitute one string for another. This function uses the s/// operator and returns the modified string. replace accepts a field name for "get". A little more convenient that calling "get" and applying s/// yourself.

replace takes three parameters...

The field to change
The regular expression to match against
The string to replace the match with

All instances of the matching pattern are replaced. For the patterns, you can use strings or regular expression references.

Other

is_valid

This method returns true or false. True means that the pipeline is ready to go. False, of course, means that there's a problem. In a list context, is_invalid returns the false value and an error message. On success, the error message is undef.

ADVANCED TOPICS

Multiple input sources

It is not uncommon to receive your data spread across more than one file. How do you guarantee that each pipeline pulls files from the same working directory ("work_in")? You "chain" the pipelines together.

The "chain" method works like this...

  ETL::Pipeline->new( {
    work_in   => {search => 'C:\Data', find => qr/Ficticious/},
    input     => ['Excel', iname => 'main.xlsx'              ],
    mapping   => {Name => 'A', Address => 'B', ID => 'C'     },
    constants => {Type => 1, Information => 'Demographic'    },
    output    => ['SQL', table => 'NewData'                  ],
  } )->process->chain( {
    input     => ['Excel', iname => 'notes.xlsx'        ],
    mapping   => {User => 'A', Text => 'B', Date => 'C' },
    constants => {Type => 2, Information => 'Note'      },
    output    => ['SQL', table => 'OtherData'           ],
  } )->process;

When the first pipeline finishes, it creates a new object with the same "work_in". The code then calls "process" on the new object. The second pipeline copies "work_in" from the first pipeline.

Writing an input source

ETL::Pipeline provides some basic, generic input sources. Inevitably, you will come across data that doesn't fit one of these. No problem. ETL::Pipeline lets you create your own input sources.

An input source is a Moose class that implements the ETL::Pipeline::Input role. The role requires that you define the "run" in ETL::Pipeline::Input method. ETL::Pipeline calls that method. Name your class ETL::Pipeline::Input::* and the "input" method can find it automatically.

See ETL::Pipeline::Input for more details.

Writing an output destination

ETL::Pipeline does not have any default output destinations. Output destinations are customized. You have something you want done with the data. And that something intimately ties into your specific business. You will have to write at least one output destination to do anything useful.

An output destination is a Moose class that implements the ETL::Pipeline::Output role. The role defines required methods. ETL::Pipeline calls those methods. Name your class ETL::Pipeline::Output::* and the "output" method can find it automatically.

See ETL::Pipeline::Output for more details.

Why are the inputs and outputs separate?

Wouldn't it make sense to have an input source for Excel and an output destination for Excel?

Input sources are generic. It takes the same code to read from one Excel file as another. Output destinations, on the other hand, are customized for your business - with data validation and business logic.

ETL::Pipeline assumes that you have multiple input sources. Different feeds use different formats. But output destinations will be much fewer. You're writing data into a centralized place.

For these reasons, it makes sense to keep the input sources and output destinations separate. You can easily add more inputs without affecting the outputs.

Custom logging

The default status method send updates to STDOUT. If you want to add log files or integrate with a GUI, then subclass ETL::Pipeline and override the "status" method.

SEE ALSO

ETL::Pipeline::Input, ETL::Pipeline::Output

Input Source Formats

ETL::Pipeline::Input::Excel, ETL::Pipeline::Input::DelimitedText, ETL::Pipeline::Input::JsonFiles, ETL::Pipeline::Input::Xml, ETL::Pipeline::Input::XmlFiles

REPOSITORY

https://github.com/rbwohlfarth/ETL-Pipeline

AUTHOR

Robert Wohlfarth <robert.j.wohlfarth@vumc.org>

COPYRIGHT AND LICENSE

Copyright (c) 2021 Robert Wohlfarth

This module is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0. For more details, see the full text of the licenses in the directory LICENSES.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied