The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

File::ByLine - Line-by-line file access loops

VERSION

version 1.181981

SYNOPSIS

  use File::ByLine;

  #
  # Procedural Interface (Simple!)
  #

  # Execute a routine for each line of a file
  dolines { say "Line: $_" } "file.txt";
  forlines "file.txt", sub { say "Line: $_" };

  # Grep (match) lines of a file
  my (@result) = greplines { m/foo/ } "file.txt";

  # Apply a function to each line and return result
  my (@result) = maplines { lc($_) } "file.txt";

  # Parallelized forlines/dolines routines
  # (Note: Requires Parallel::WorkUnit to be installed)
  parallel_dolines { foo($_) } "file.txt", 10;
  parallel_forlines "file.txt", 10, sub { foo($_); };

  # Parallelized maplines and greplines
  my (@result) = parallel_greplines { m/foo/ } "file.txt", 10;
  my (@result) = parallel_maplines  { lc($_) } "file.txt", 10;

  # Read an entire file, split into lines
  my (@result) = readlines "file.txt";


  #
  # Functional Interface
  #

  # Execute a routine for each line of a file
  my $byline = File::ByLine->new();
  $byline->do( sub { say "Line: $_" }, "file.txt");

  # Grep (match) lines of a file
  my $byline = File::ByLine->new();
  my (@result) = $byline->grep( sub { m/foo/ }, "file.txt");

  # Apply a function to each line and return result
  my $byline = File::ByLine->new();
  my (@result) = $byline->map( sub { lc($_) }, "file.txt");

  # Parallelized routines
  # (Note: Requires Parallel::WorkUnit to be installed)
  my $byline = File::ByLine->new();
  $byline->processes(10);
  $byline->do( sub { foo($_) }, "file.txt");
  my (@grep_result) = $byline->grep( sub { m/foo/ }, "file.txt");
  my (@map_result)  = $byline->map( sub { lc($_) }, "file.txt");

  # Skip the header line
  my $byline = File::ByLine->new();
  $byline->skip_header(1);
  $byline->do( sub { foo($_) }, "file.txt");
  my (@grep_result) = $byline->grep( sub { m/foo/ }, "file.txt");
  my (@map_result)  = $byline->map( sub { lc($_) }, "file.txt");

  # Process the header line
  my $byline = File::ByLine->new();
  $byline->header_handler( sub { say $_; } );
  $byline->do( sub { foo($_) }, "file.txt");
  my (@grep_result) = $byline->grep( sub { m/foo/ }, "file.txt");
  my (@map_result)  = $byline->map( sub { lc($_) }, "file.txt");

  # Read an entire file, split into lines
  my (@result) = readlines "file.txt";

  # Alternative way of specifying filenames
  my $byline = File::ByLine->new();
  $byline->file("file.txt")
  $byline->do( sub { foo($_) } );
  my (@grep_result) = $byline->grep( sub { m/foo/ } );
  my (@map_result)  = $byline->map( sub { lc($_) } );

DESCRIPTION

Finding myself writing the same trivial loops to read files, or relying on modules like Perl6::Slurp that didn't quite do what I needed (abstracting the loop), it was clear something easy, simple, and sufficiently Perl-ish was needed.

FUNCTIONS

dolines

  dolines { say "Line: $_" } "file.txt";
  dolines \&func, "file.txt";

This function calls a coderef once for each line in the file. The file is read line-by-line, removes the newline character(s), and then executes the coderef.

Each line (without newline) is passed to the coderef as the only parameter to the coderef. It is also placed into $_.

This function returns the number of lines in the file.

This is similar to forlines(), except for order of arguments. The author recommends this form for short code blocks - I.E. a coderef that fits on one line. For longer, multi-line code blocks, the author recommends the forlines() syntax.

Instead of a single filename, an arrayref can be passed in, in which case the files are read in turn as if they are all one file. Note that if the file doesn't end in a newline, a newline is inserted before processing the next file.

forlines

  forlines "file.txt", sub { say "Line: $_" };
  forlines "file.txt", \&func;

This function calls a coderef once for each line in the file. The file is read line-by-line, removes the newline character(s), and then executes the coderef.

Each line (without newline) is passed to the coderef as the only parameter to the coderef. It is also placed into $_.

This function returns the number of lines in the file.

This is similar to dolines(), except for order of arguments. The author recommends this when using longer, multi-line code blocks, even though it is not orthogonal with the maplines()/greplines() routines.

parallel_dolines

  my (@result) = parallel_dolines { foo($_) } "file.txt", 10;

Requires Parallel::WorkUnit to be installed.

Three parameters are requied: a codref, a filename, and number of simultanious child threads to use.

Instead of a single filename, an arrayref can be passed in, in which case the files are read in turn as if they are all one file. Note that if the file doesn't end in a newline, a newline is inserted before processing the next file.

This function performs similar to dolines(), except that it does its' operations in parallel using fork() and Parallel::WorkUnit. Because the code in the coderef is executed in a child process, any changes it makes to variables in high scopes will not be visible outside that single child. In general, it will be safest to not modify anything that belongs outside this scope.

Note that the file will be read in several chunks, with each chunk being processed in a different thread. This means that the child threads may be operating on very different sections of the file simultaniously and no specific order of execution of the coderef should be expected!

Because of the mechanism used to split the file into chunks for processing, each thread may process a somewhat different number of lines. This is particularly true if there are a mix of very long and very short lines. The splitting routine splits the file into roughly equal size chunks by byte count, not line count.

Otherwise, this function is identical to dolines(). See the documentation for dolines() or forlines() for information about how this might differ from parallel_forlines().

parallel_forlines

  my (@result) = parallel_forlines "file.txt", 10, sub { foo($_) };

Requires Parallel::WorkUnit to be installed.

Three parameters are requied: a filename, a codref, and number of simultanious child threads to use.

Instead of a single filename, an arrayref can be passed in, in which case the files are read in turn as if they are all one file. Note that if the file doesn't end in a newline, a newline is inserted before processing the next file.

This function performs similar to forlines(), except that it does its' operations in parallel using fork() and Parallel::WorkUnit. Because the code in the coderef is executed in a child process, any changes it makes to variables in high scopes will not be visible outside that single child. In general, it will be safest to not modify anything that belongs outside this scope.

Note that the file will be read in several chunks, with each chunk being processed in a different thread. This means that the child threads may be operating on very different sections of the file simultaniously and no specific order of execution of the coderef should be expected!

Because of the mechanism used to split the file into chunks for processing, each thread may process a somewhat different number of lines. This is particularly true if there are a mix of very long and very short lines. The splitting routine splits the file into roughly equal size chunks by byte count, not line count.

Otherwise, this function is identical to forlines(). See the documentation for forlines() or dolines() for information about how this might differ from parallel_dolines().

greplines

  my (@result) = greplines { m/foo/ } "file.txt";

Requires Parallel::WorkUnit to be installed.

This function calls a coderef once for each line in the file, and, based on the return value of that coderef, returns only the lines where the coderef evaluates to true. This is similar to the grep built-in function, except operating on file input rather than array input.

Each line (without newline) is passed to the coderef as the only parameter to the coderef. It is also placed into $_.

This function returns the lines for which the coderef evaluates as true.

parallel_greplines

  my (@result) = parallel_greplines { m/foo/ } "file.txt", 10;

Three parameters are requied: a coderef, filename, and number of simultanious child threads to use.

Instead of a single filename, an arrayref can be passed in, in which case the files are read in turn as if they are all one file. Note that if the file doesn't end in a newline, a newline is inserted before processing the next file.

This function performs similar to greplines(), except that it does its' operations in parallel using fork() and Parallel::WorkUnit. Because the code in the coderef is executed in a child process, any changes it makes to variables in high scopes will not be visible outside that single child. In general, it will be safest to not modify anything that belongs outside this scope.

If a large amount of data is returned, the overhead of passing the data from child to parents may exceed the benefit of parallelization. However, if there is substantial line-by-line processing, there likely will be a speedup, but trivial loops will not speed up.

Note that the file will be read in several chunks, with each chunk being processed in a different thread. This means that the child threads may be operating on very different sections of the file simultaniously and no specific order of execution of the coderef should be expected! However, the results will be returned in the same order as greplines() would return them.

Because of the mechanism used to split the file into chunks for processing, each thread may process a somewhat different number of lines. This is particularly true if there are a mix of very long and very short lines. The splitting routine splits the file into roughly equal size chunks by byte count, not line count.

Otherwise, this function is identical to greplines().

maplines

  my (@result) = maplines { lc($_) } "file.txt";

This function calls a coderef once for each line in the file, and, returns an array of return values from those calls. This follows normal Perl rules - basically if the coderef returns a list, all elements of that list are added as distinct elements to the return value array. If the coderef returns an empty list, no elements are added.

Each line (without newline) is passed to the coderef as the only parameter to the coderef. It is also placed into $_.

This is meant to be similar to the built-in map function.

Because of the mechanism used to split the file into chunks for processing, each thread may process a somewhat different number of lines. This is particularly true if there are a mix of very long and very short lines. The splitting routine splits the file into roughly equal size chunks by byte count, not line count.

This function returns the lines for which the coderef evaluates as true.

parallel_maplines

  my (@result) = parallel_maplines { lc($_) } "file.txt", 10;

Three parameters are requied: a coderef, filename, and number of simultanious child threads to use.

Instead of a single filename, an arrayref can be passed in, in which case the files are read in turn as if they are all one file. Note that if the file doesn't end in a newline, a newline is inserted before processing the next file.

This function performs similar to maplines(), except that it does its' operations in parallel using fork() and Parallel::WorkUnit. Because the code in the coderef is executed in a child process, any changes it makes to variables in high scopes will not be visible outside that single child. In general, it will be safest to not modify anything that belongs outside this scope.

If a large amount of data is returned, the overhead of passing the data from child to parents may exceed the benefit of parallelization. However, if there is substantial line-by-line processing, there likely will be a speedup, but trivial loops will not speed up.

Note that the file will be read in several chunks, with each chunk being processed in a different thread. This means that the child threads may be operating on very different sections of the file simultaniously and no specific order of execution of the coderef should be expected! However, the results will be returned in the same order as maplines() would return them.

Otherwise, this function is identical to maplines().

readlines

  my (@result) = readlines "file.txt";

This function simply returns an array of lines (without newlines) read from a file.

OBJECT ORIENTED INTERFACE

The object oriented interface was implemented in version 1.181860.

new

  my $byline = File::ByLine->new();

Constructs a new object, suitable for the object oriented calls below.

ATTRIBUTES

extended_info

  $extended = $byline->extended_info();
  $byline->extended_info(1);

This was added in version 1.181951.

Gets and sets the "extended information" flag. This defaults to false, but if set to a true value this will pass a second parameter to all user-defined code (such as the per-line code function in dolines and do and the header_handler function.

For all code, this information will be passed as the second argument to the user defined code. It will be a hashref with the following keys defined:

filename - The filename currently being processed
object - An object corresponding to either the current explicit or implicit File::ByLine object
process_number - Which child process (first process is zero)

This object should not be modified by user code. In addition, no attributes of the explict or implicit File::ByLine object passed as part of this hashref should be modified within user code.

file

  my $file = $byline->file();
  $byline->file("abc.txt");
  $byline->file( [ "abc.txt", "def.txt" ] );
  $byline->file( "$abc.txt", "def.txt" );

Gets and sets the default filename used by the methods in the object oriented interface. The default value is undef which indicates that no default filename is provided.

Instead of a single filename, a list or arrayref can be passed in, in which case the files are read in turn as if they are all one file. Note that if the file doesn't end in a newline, a newline is inserted before processing the next file.

header_all_files

  my $all_files = $byline->header_all_files();
  $byline->header_all_files(1);

Gets and sets whether the object oriented methods will call header_handler for every file if multiple files are passed into the file attribute.

The anticipated usage of this would be with extended_info set to true, with the header_handler function examining the filename attribute of the extended info hashref. Note that all headers may be read before any line in any file is read, to better accommodate parallel code execution. I.E. the headers of all files may be read at once before any data line is read.

header_handler

  my $handler = $byline->header_handler();
  $byline->header_handler( sub { ... } );

Specifies code that should be executed on the header row of the input file. This defaults to undef, which indicates no header handler is specified. When a header handler is specified, the first row of the file is sent to this handler, and is not sent to the code provided to the various do/grep/map/lines methods in the object oriented interface.

The code is called with one or two parameters, the header line, and, if the extended_info attribute is set, the extended information hashref. The header line is also stored in $_.

When set, this is always executed in the parent process, not in the child processes that are spawned (in the case of processes being greater than one).

You cannot set this to true while a header_skip value is set.

processes

  my $procs = $byline->processes();
  $byline->processes(10);

This gets and sets the degree of parallelism most methods will use. The default degree is 1, which indicates all tasks should only use a single process. Specifying 2 or greater will use multiple processes to operate on the file (see documentation for the parallel_* functions described above for more details).

skip_unreadable

  my $unreadable = $byline->skip_unreadable();
  $byline->skip_unreadable(10);

This was added in version 1.181980.

If this attribute is true, unreadable files are treated as empty files during processing. The default is false, in which case an exception is thrown when an access attempt is made to an unreadable file.

Short Name Aliases for Attributes

  $byline->f();     # Alias for file
  $byline->ei();    # Alias for extended_info
  $byline->haf();   # Alias for header_all_files
  $byline->hh();    # Alias for header_handler
  $byline->hs();    # Alias for header_skip
  $byline->p();     # Alias for processes
  $byline->su();    # Alias for skip_unreadable

Short name aliases were added in version 1.181980.

Each attribute listed above has a corresponding short name. This short name can also be used as a constructor argument.

METHODS

do

  $byline->do( sub { ... }, "file.txt" );

This performs the dolines functionality, calling the code provided. If the filename is not provided, the file attribute is used for this. See the dolines and parallel_dolines functions for more information on how this functions.

Each line (without newline) is passed to the coderef as the first parameter to the coderef. It is also placed into $_. If the extended_info attribute is true, the extended information hashref will be passed as the second parameter.

Instead of a single filename, an arrayref can be passed in, in which case the files are read in turn as if they are all one file. Note that if the file doesn't end in a newline, a newline is inserted before processing the next file.

grep

  my (@output) = $byline->grep( sub { ... }, "file.txt" );

This performs the greplines functionality, calling the code provided. If the filename is not provided, the file attribute is used for this. See the greplines and parallel_greplines functions for more information on how this functions.

Each line (without newline) is passed to the coderef as the first parameter to the coderef. It is also placed into $_. If the extended_info attribute is true, the extended information hashref will be passed as the second parameter.

The output is a list of all input lines where the code reference produces a true result.

Instead of a single filename, an arrayref can be passed in, in which case the files are read in turn as if they are all one file. Note that if the file doesn't end in a newline, a newline is inserted before processing the next file.

map

  my (@output) = $byline->map( sub { ... }, "file.txt" );

This performs the maplines functionality, calling the code provided. If the filename is not provided, the file attribute is used for this. See the maplines and parallel_maplines functions for more information on how this functions.

Each line (without newline) is passed to the coderef as the first parameter to the coderef. It is also placed into $_. If the extended_info attribute is true, the extended information hashref will be passed as the second parameter.

The output is the list produced by calling the passed-in code repeatively for each line of input.

Instead of a single filename, an arrayref can be passed in, in which case the files are read in turn as if they are all one file. Note that if the file doesn't end in a newline, a newline is inserted before processing the next file.

lines

  my (@output) = $byline->lines( "file.txt" );

This performs the readlines functionality. If the filename is not provided, the file attribute is used for this. See the readlines function for more information on how this functions.

The output is a list of all input lines.

Note that this function is unaffected by the value of the processes attribute - it always executes in the parent process.

Instead of a single filename, an arrayref can be passed in, in which case the files are read in turn as if they are all one file. Note that if the file doesn't end in a newline, a newline is inserted before processing the next file.

SUGGESTED DEPENDENCY

The Parallel::WorkUnit module is a recommended dependency. It is required to use the parallel_* functions - all other functionality works fine without it.

Some CPAN clients will automatically try to install recommended dependency, but others won't (cpan often, but not always, will; cpanm will not by default). In the cases where it is not automatically installed, you need to install Parallel::WorkUnit to get this functionality.

AUTHOR

Joelle Maslak <jmaslak@antelope.net>

COPYRIGHT AND LICENSE

This software is copyright (c) 2018 by Joelle Maslak.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.