The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

File::Process - process text files with customer handlers

SYNOPSIS

 use File::Process;

 my ($lines, $info) = process_file($file, process => sub { 
     my ($fh, $lines, $args, $line) = @_;
     return uc $line;
    });

DESCRIPTION

Many scripts need to process one or more text files. The boiler-plate usually looks something like:

 open my $fh, '<', $file
    or croak "blah blah blah...\n";

 while (<$fh> ) {
   # do something...
 }

 close $fh or
    croak "blah blah blah...\n";

The do something... part often involves other common operations like removing new lines, skipping blank lines, etc. It gets tedious when you have to write the same template for processing different files in a script.

This class provides a simple harness for processing files, taking the drudgery out of writing a simple text processor. It is most effect when used on relatively small files (see "CAVEATS").

In it's most basic form the class will return all of the lines in a text file. The class exports one method (process_file) which invokes multiple subroutines that you can override or use in conjunction with your custom processors.

See File::Process::Utils for additional recipes.

EXPORTED METHODS

This module exports one method by default (process_file), since presumably you wanted to process a file? You can export all of the default processor methods using the tag ':all'.

 use File::Process qw( pre post );

 use File::Process qw( :all );

METHODS AND SUBROUTINES

process_csv

 process_csv(file, options)

See File::Process::Utils

process_file

 process_file(file, options)

You start the processing of the file by calling process_file with the name of the file or a handle to an open file and a list of options. Note that the processors will pass and receive a reference to this list of options during the processing of the file.

The method returns a list containing a reference to an array that contains each line of the file followed by the list of elements in the hash that was originally passed to it (along with any other data your custom method has inserted into it).

 my ($lines, %options) = process_file("foo.txt", chomp => 1);
file

Path to the file to be processed or a handle to an open file.

options

A list of options. You can send whatever options your custom processor supports. Before the default or your custom process subroutine is called, the filter subroutine is called. This is where you might massage the input in some way. The default filter subroutine supports various options to perform routine tasks. Options are described below.

skip_blank_lines

Skip blank lines. A blank line is considered a line with only a new line character.

skip_comments

Set skip_lines to a true value to skip lines that beging with '#'.

merge_lines

Merges lines together rather that creating an array of lines. Typically used with the chomp option. When merge_lines is set to a true value, IO::Scalar is used to efficiently create a single scalar from all of the lines in the file. The first element of the return list then a scalar instead of an array reference.

chomp

Set chomp to a true value to remove a trailing new line.

trim

Set trim to one of front, back, both to remove whitespace from the front, back or both front and back of a line. Note that this operation is performed before your custom processor is called and may result in the line being skipped if the skip_blank_lines option is set to a true value.

Custom Processors

process_file will execute a set of subroutines for each line of the file. You can replace any of these subroutines to inject your own custom behaviors. They are executed in this order:

1. pre
2. next_line
3. filter
4. process
5. post
  • You can terminate processing of a file by returning an undefined value for the next_line() hook

  • Returning undef for the filter(), and process() hooks will prevent a line from being accumulated in the line buffer

  • Any hook you define can terminate the process by throwing an exception.

The default processors are described below.

pre(file, options)

The default pre processor opens the file and returns a file handle and a reference to an array that will be used to store the lines. If you provide your own pre process it should also return a tuple that contains the file handle and a reference to an array that will be used to store each processed line of the file. Note that you don't have to adhere to this contract if your downstream processors don't require the same returns.

file

Path to a file that can be opened for reading or a handle to an open file.

options

A reference to a hash that contains the options passed from process_file. The hash will be passed to the process method, so can be used to store data as you are processing the each line. The default process method will record counts of lines processed and other potentially useful statistics.

next_line(fh, lines, options)

The next_line method is passed the file handle, the buffer of accumulated lines, and a reference to a hash of options passed to process_file. It is expected to return the next line of the file, however your custom processor can return anything it likes. That object returned will be sent to the process subroutine for possible further processing.

Returning undef will halt further processing.

filter(fh, lines, options, current_line)

The default filter method will perform various tasks (chomp, trim, skip) controlled by the options described above.

If the chomp option is set to true when you called process_file, the line will be chomped. You can also set the skip_blank_lines or skip_comments to skip blank lines or skip lines that begin with '# '.

Filters should return the line or undef to skip the current line. If you really want to add undef to your buffer, do so in your filter:

 push @{$lines}, undef;

If you want to halt processing here, die in your filter. Any exception will halt further processing.

process(fh, lines, options, current_line)

The process method is passed the file handle, the buffer of accumulated lines, a reference to a hash of options passed to process_file and the next line of the the text file. The default processor simply returns the current_line value.

fh

Handle to an open file or an object that supports an IO::Handle like interface. If fh is undefined the

lines

A reference to an array that contains the lines read thus far.

options

A reference to a hash of options passed to process_file.

current_line

The next line of data from the file.

post(fh, lines, options)

The post method is passed the same three arguments as passed to process. The default post method closes the file and records the end time of process. The default post method returns an array reference to the buffer of lines and list of options. Note that a reference to the list is passed in but a list is returned. This is also the return value of process_file. Your custon post can return anything it wants.

DEFAULT PROCESSORS

Any of default processors (pre, next_line, filter, process, post) can be called before or after your custom processors. Pass these methods the same list you receive.

  process_file(
    "foo.txt",
    post  => sub {
      my @retval = post(@_);
      $retval[0] = join '', @{ $_[1] };
      return @retval;
    }
  );

STATISTICS

start_time
end_time
raw_count
skipped

EXAMPLES

  • Return the all of the lines in a text file

     my ($lines) = process_file('foo.txt');
  • Read JSON file

      print Dumper(
        process_file(
          $fh,
          chomp => 1,
          post  => sub {
            post(@_);
            return decode_json( join '', @{ $_[1] } );
          }
        )
      );

    ...or

      print Dumper(
        decode_json(
          process_file(
            $fh,
            chomp       => 1,
            merge_lines => 1
          )
        )
      );
  • Read CSV file

    Presented here as example, however you can use the "process_csv" method for processing CSV files.

     use File::Process qw(pre process_file);
     use Text::CSV_XS;
     use Data::Dumper;
     
     my $csv = Text::CSV_XS->new;
      
     my $file = shift;
     
     my ($csv_lines) = process_file(
       $file,
       csv   => $csv,
       chomp => 1,
       has_headers => 1,
       pre   => sub {
         my ( $fh, $args ) = @_;
      
         my ($fh, $all_lines) = pre($file, $args);
     
         if ( $args->{'has_headers'} ) {
           my @column_names = $args->{csv}->getline($fh);
           $args->{csv}->column_names(@column_names);
         }
      
         return ($fh, $all_lines);
       },
       next_line => sub {
         my ( $fh, $all_lines, $args ) = @_;
         my $ref = $args->{csv}->getline_hr($fh);
         return $ref;
         }
     )

CAVEATS

Processing each line using hooks and callbacks can introduce inefficiencies in file processing. This class is meant to be used on moderately sized files. In it's basic forms, the methods will read all lines into memory as it iterates over the file. Your processing may not require that lines be accumulated at all. Your custom process() or filter() hook can choose to return an undefined value which prevents a line from being added to the buffer.

Reading each line one-at-a-time may be inefficient as well, future version may introduce a slurp mode and/or the ability to send an array which represents a list of lines to process.

Some example times:

Timings were done a Linux system running on an 11th Gen Intel(R) Core(TM) i7-1160G7 @ 1.20GHz (8 threads, 4.40GHz)

As a baseline:

  • Reading ~900K rows (pure Perl)

     .22s
  • Slurping ~900K rows (pure Perl)

     .03s

Using File::Process::process_file()

  • Reading ~900K rows (no processing)

     ~1.6s
  • Reading ~900K rows from a CSV file with 5 columns

    • Returning an array of hashes:

       7-8s
    • Returning an array of arrays:

       ~10s

...so there's room for improving the speed of these calls...caveat emptor.

LICENSE

This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.

SEE ALSO

File::Process:Utils

AUTHOR

Rob Lauer - <rlauer6@comcast.net>