The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

ETL::Pipeline::Input::File - Role for file based input sources

SYNOPSIS

  # In the input source...
  use Moose;
  with 'ETL::Pipeline::Input';
  with 'ETL::Pipeline::Input::File';
  ...

  # In the ETL::Pipeline script...
  ETL::Pipeline->new( {
    work_in   => {root => 'C:\Data', iname => qr/Ficticious/},
    input     => ['Excel', iname => qr/\.xlsx?$/            ],
    mapping   => {Name => 'A', Address => 'B', ID => 'C'    },
    constants => {Type => 1, Information => 'Demographic'   },
    output    => ['SQL', table => 'NewData'                 ],
  } )->process;

  # Or with a specific file...
  ETL::Pipeline->new( {
    work_in   => {root => 'C:\Data', iname => qr/Ficticious/},
    input     => ['Excel', iname => 'ExportedData.xlsx'     ],
    mapping   => {Name => 'A', Address => 'B', ID => 'C'    },
    constants => {Type => 1, Information => 'Demographic'   },
    output    => ['SQL', table => 'NewData'                 ],
  } )->process;

DESCRIPTION

This role adds functionality and attributes common to all file based input sources. It is a quick and easy way to create new sources with the ability to search directories. Useful when the file name changes.

ETL::Pipeline::Input::File works with a single source file. To process an entire directory of files, use ETL::Pipeline::Input::FileListing instead.

METHODS & ATTRIBUTES

Arguments for "input" in ETL::Pipeline

ETL::Pipeline::Input::File accepts any of the tests provided by Path::Iterator::Rule. The value of the argument is passed directly into the test. For boolean tests (e.g. readable, exists, etc.), pass an undef value.

ETL::Pipeline::Input::File automatically applies the file filter. Do not pass file through "input" in ETL::Pipeline.

iname is the most common one that I use. It matches the file name, supports wildcards and regular expressions, and is case insensitive.

  # Search using a regular expression...
  $etl->input( 'Excel', iname => qr/\.xlsx$/ );

  # Search using a file glob...
  $etl->input( 'Excel', iname => '*.xlsx' );

The code throws an error if no files match the criteria. Only the first match is used. If you want to match more than one file, use ETL::Pipeline::Input::File::List instead.

path

Optional. When passed to "input" in ETL::Pipeline, this file becomes the input source. No search or matching is performed. If you specify a relative path, it is relative to "data_in".

Once the object has been created, this attribute holds the file that matched search criteria. It should be used by your input source class as the file name.

  # File inside of "data_in"...
  $etl->input( 'Excel', path => 'Data.xlsx' );

  # Absolute path name...
  $etl->input( 'Excel', path => 'C:\Data.xlsx' );

  # Inside the input source class...
  open my $io, '<', $self->path;

skipping

Optional. skipping jumps over a certain number of rows/lines in the beginning of the file. Report formats often contain extra headers - even before the column names. skipping ignores those and starts processing at the data.

Note: skipping is applied before reading column names.

skipping accepts either an integer or code reference. An integer represents the number of rows/records to ignore. For a code reference, the code discards records until the subroutine returns a true value.

  # Bypass the first three rows.
  $etl->input( 'Excel', skipping => 3 );

  # Bypass until we find something in column 'C'.
  $etl->input( 'Excel', skipping => sub { hascontent( $_->get( 'C' ) ) } );

The exact nature of the record depends on the input file. For example files, Excel files will send a data row as a hash. But a CSV file would send a single line of plain text with no parsing. See the input source to find out exactly what it sends.

If your input source implements skipping, you can pass whatever parameters you want. For consistency, I recommend passing the raw data. If you are jumping over report headers, they may not be formatted.

SEE ALSO

ETL::Pipeline, ETL::Pipeline::Input, ETL::Pipeline::Input::File::List, Path::Iterator::Rule

AUTHOR

Robert Wohlfarth <robert.j.wohlfarth@vumc.org>

LICENSE

Copyright 2021 (c) Vanderbilt University Medical Center

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.