ETL::Pipeline::Input - Role for ETL::Pipeline input sources
use Moose; with 'ETL::Pipeline::Input'; sub run { # Add code to read your data here ... }
An input source feeds the extract part of ETL. This is where data comes from. These are your data sources.
A data source may be anything - a file, a database, or maybe a socket. Each format is an ETL::Pipeline input source. For example, Excel files represent one input source. Perl reads every Excel file the same way. With a few judicious attributes, we can re-use the same input source for just about any type of Excel file.
ETL::Pipeline defines an input source as a Moose object with at least one method - run. This role basically defines the requirement for the run method. It should be consumed by all input source classes. ETL::Pipeline relies on the input source having this role.
run
ETL::Pipeline::Input
use Moose;
with 'ETL::Pipeline::Input';
The new source is ready to use, like this...
$etl->input( 'YourNewSource' );
You can leave off the leading ETL::Pipeline::Input::.
When ETL::Pipeline calls "run", it passes the ETL::Pipeline object as the only parameter.
Input sources mostly follow the basic algorithm of open, read, process, and close. I originally had the role define methods for each of these steps. That was a lot of work, and kind of confusing. This way, the input source only needs one code block that does all of these steps - in one place. So it's easier to troubleshoot and write new sources.
In the work that I do, we have one output destination that rarely changes. It's far more common to write new input sources - especially customized sources. Making new sources easier saves time. Making it simpler means that more developers can pick up those tasks.
No. ETL::Pipeline::Input works for any source of data, such as SQL queries, CSV files, or network sockets. Tailor the run method for whatever suits your needs.
Because files are most common, ETL::Pipeline comes with a helpful role - ETL::Pipeline::Input::File. Consume ETL::Pipeline::Input::File in your inpiut source to access some standardized attributes.
ETL::Pipeline version 3 is not compatible with input sources from older versions. You will need to rewrite your custom input sources.
setup
finish
next_record
$etl-
If you define this, the standard logging will include it. The attribute is named for file inputs. But it can return any value that is meaningful to your users.
If you define this, the standard logging includes it with error or informational messages. It can be any value that helps users locate the correct place to troubleshoot.
You define this method in the consuming class. It should open the file, read each record, call "record" in ETL::Pipeline after each record, and close the file. This method is the workhorse. It defines the main ETL loop. "record" in ETL::Pipeline acts as a callback.
I say file. It really means input source - whatever that might be.
Some important things to remember about run...
If your code encounters an error, run can call "status" in ETL::Pipeline with the error message. "status" in ETL::Pipeline should automatically include the record count with the error message. You should add any other troubleshooting information such as file names or key fields.
$etl->status( "ERROR", "Error message here for id $id" );
For fatal errors, I recommend using the croak command from Carp.
croak
The location in the input source of the current record. For example, for files this would be the file name and character position. The consuming class can set this value in its run method.
Logging uses this when displaying errors or informational messages. The value should be something that helps the user troubleshoot issues. It can be whatever is appropriate for the input source.
NOTE: Don't capitalize the first letter, unless it's supposed to be. Logging will upper case the first letter if it's appropriate.
ETL::Pipeline, ETL::Pipeline::Input::File, ETL::Pipeline::Output
Robert Wohlfarth <robert.j.wohlfarth@vumc.org>
Copyright 2021 (c) Vanderbilt University Medical Center
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install ETL::Pipeline, copy and paste the appropriate command in to your terminal.
cpanm
cpanm ETL::Pipeline
CPAN shell
perl -MCPAN -e shell install ETL::Pipeline
For more information on module installation, please visit the detailed CPAN module installation guide.