ETL::Pipeline::Input::File - Role for file based input sources
# In the input source... use Moose; with 'ETL::Pipeline::Input'; with 'ETL::Pipeline::Input::File'; ... # In the ETL::Pipeline script... ETL::Pipeline->new( { work_in => {root => 'C:\Data', iname => qr/Ficticious/}, input => ['Excel', iname => qr/\.xlsx?$/ ], mapping => {Name => 'A', Address => 'B', ID => 'C' }, constants => {Type => 1, Information => 'Demographic' }, output => ['SQL', table => 'NewData' ], } )->process; # Or with a specific file... ETL::Pipeline->new( { work_in => {root => 'C:\Data', iname => qr/Ficticious/}, input => ['Excel', iname => 'ExportedData.xlsx' ], mapping => {Name => 'A', Address => 'B', ID => 'C' }, constants => {Type => 1, Information => 'Demographic' }, output => ['SQL', table => 'NewData' ], } )->process;
This role adds functionality and attributes common to all file based input sources. It is a quick and easy way to create new sources with the ability to search directories. Useful when the file name changes.
ETL::Pipeline::Input::File works with a single source file. To process an entire directory of files, use ETL::Pipeline::Input::FileListing instead.
ETL::Pipeline::Input::File accepts any of the tests provided by Path::Iterator::Rule. The value of the argument is passed directly into the test. For boolean tests (e.g. readable, exists, etc.), pass an undef value.
undef
ETL::Pipeline::Input::File automatically applies the file filter. Do not pass file through "input" in ETL::Pipeline.
file
iname is the most common one that I use. It matches the file name, supports wildcards and regular expressions, and is case insensitive.
iname
# Search using a regular expression... $etl->input( 'Excel', iname => qr/\.xlsx$/ ); # Search using a file glob... $etl->input( 'Excel', iname => '*.xlsx' );
The code throws an error if no files match the criteria. Only the first match is used. If you want to match more than one file, use ETL::Pipeline::Input::File::List instead.
Optional. When passed to "input" in ETL::Pipeline, this file becomes the input source. No search or matching is performed. If you specify a relative path, it is relative to "data_in".
Once the object has been created, this attribute holds the file that matched search criteria. It should be used by your input source class as the file name.
# File inside of "data_in"... $etl->input( 'Excel', path => 'Data.xlsx' ); # Absolute path name... $etl->input( 'Excel', path => 'C:\Data.xlsx' ); # Inside the input source class... open my $io, '<', $self->path;
Optional. skipping jumps over a certain number of rows/lines in the beginning of the file. Report formats often contain extra headers - even before the column names. skipping ignores those and starts processing at the data.
Note: skipping is applied before reading column names.
skipping accepts either an integer or code reference. An integer represents the number of rows/records to ignore. For a code reference, the code discards records until the subroutine returns a true value.
# Bypass the first three rows. $etl->input( 'Excel', skipping => 3 ); # Bypass until we find something in column 'C'. $etl->input( 'Excel', skipping => sub { hascontent( $_->get( 'C' ) ) } );
The exact nature of the record depends on the input file. For example files, Excel files will send a data row as a hash. But a CSV file would send a single line of plain text with no parsing. See the input source to find out exactly what it sends.
If your input source implements skipping, you can pass whatever parameters you want. For consistency, I recommend passing the raw data. If you are jumping over report headers, they may not be formatted.
ETL::Pipeline, ETL::Pipeline::Input, ETL::Pipeline::Input::File::List, Path::Iterator::Rule
Robert Wohlfarth <robert.j.wohlfarth@vumc.org>
Copyright 2021 (c) Vanderbilt University Medical Center
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install ETL::Pipeline, copy and paste the appropriate command in to your terminal.
cpanm
cpanm ETL::Pipeline
CPAN shell
perl -MCPAN -e shell install ETL::Pipeline
For more information on module installation, please visit the detailed CPAN module installation guide.