The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

File::ByLine - Line-by-line file access loops

VERSION

version 1.002

SYNOPSIS

  use File::ByLine;

  #
  # Execute a routine for each line of a file
  #
  forlines "file.txt", { say "Line: $_" };

  #
  # Grep (match) lines of a file
  #
  my (@result) = greplines { m/foo/ } "file.txt";

  #
  # Apply a function to each line and return result
  #
  my (@result) = maplines { lc($_) } "file.txt";

  #
  # Parallelized forlines routnie
  #
  parallel_forlines "file.txt", 10, { foo($_); };

  #
  # Parallelized maplines and greplines
  #
  my (@result) = parallel_greplines { m/foo/ } "file.txt", 10;
  my (@result) = parallel_maplines  { lc($_) } "file.txt", 10;

  #
  # Read an entire file, split into lines
  #
  my (@result) = readlines "file.txt";

DESCRIPTION

Finding myself writing the same trivial loops to read files, or relying on modules like Perl6::Slurp that didn't quite do what I needed (abstracting the loop), it was clear something easy, simple, and sufficiently Perl-ish was needed.

FUNCTIONS

forlines

  forlines "file.txt", { say "Line: $_" };
  forlines "file.txt", \&func;

This function calls a coderef once for each line in the file. The file is read line-by-line, removes the newline character(s), and then executes the coderef.

Each line (without newline) is passed to the coderef as the first parameter and only parameter to the coderef. It is also placed into $_.

This function returns the number of lines in the file.

parallel_forlines

  my (@result) = parallel_forlines "file.txt", 10, { foo($_) };

Three parameters are requied: a filename, a codref, and number of simultanious child threads to use.

This function performs similar to forlines(), except that it does its' operations in parallel using fork() and Parallel::WorkUnit. Because the code in the coderef is executed in a child process, any changes it makes to variables in high scopes will not be visible outside that single child. In general, it will be safest to not modify anything that belongs outside this scope.

Note that the file will be read in several chunks, with each chunk being processed in a different thread. This means that the child threads may be operating on very different sections of the file simultaniously and no specific order of execution of the coderef should be expected!

Because of the mechanism used to split the file into chunks for processing, each thread may process a somewhat different number of lines. This is particularly true if there are a mix of very long and very short lines. The splitting routine splits the file into roughly equal size chunks by byte count, not line count.

Otherwise, this function is identical to forlines().

greplines

  my (@result) = greplines { m/foo/ } "file.txt";

This function calls a coderef once for each line in the file, and, based on the return value of that coderef, returns only the lines where the coderef evaluates to true. This is similar to the grep built-in function, except operating on file input rather than array input.

Each line (without newline) is passed to the coderef as the first parameter and only parameter to the coderef. It is also placed into $_.

This function returns the lines for which the coderef evaluates as true.

parallel_greplines

  my (@result) = parallel_greplines { m/foo/ } "file.txt", 10;

Three parameters are requied: a coderef, filename, and number of simultanious child threads to use.

This function performs similar to greplines(), except that it does its' operations in parallel using fork() and Parallel::WorkUnit. Because the code in the coderef is executed in a child process, any changes it makes to variables in high scopes will not be visible outside that single child. In general, it will be safest to not modify anything that belongs outside this scope.

If a large amount of data is returned, the overhead of passing the data from child to parents may exceed the benefit of parallelization. However, if there is substantial line-by-line processing, there likely will be a speedup, but trivial loops will not speed up.

Note that the file will be read in several chunks, with each chunk being processed in a different thread. This means that the child threads may be operating on very different sections of the file simultaniously and no specific order of execution of the coderef should be expected! However, the results will be returned in the same order as greplines() would return them.

Because of the mechanism used to split the file into chunks for processing, each thread may process a somewhat different number of lines. This is particularly true if there are a mix of very long and very short lines. The splitting routine splits the file into roughly equal size chunks by byte count, not line count.

Otherwise, this function is identical to greplines().

maplines

  my (@result) = maplines { lc($_) } "file.txt";

This function calls a coderef once for each line in the file, and, returns an array of return values from those calls. This follows normal Perl rules - basically if the coderef returns a list, all elements of that list are added as distinct elements to the return value array. If the coderef returns an empty list, no elements are added.

Each line (without newline) is passed to the coderef as the first parameter and only parameter to the coderef. It is also placed into $_.

This is meant to be similar to the built-in map function.

Because of the mechanism used to split the file into chunks for processing, each thread may process a somewhat different number of lines. This is particularly true if there are a mix of very long and very short lines. The splitting routine splits the file into roughly equal size chunks by byte count, not line count.

This function returns the lines for which the coderef evaluates as true.

parallel_maplines

  my (@result) = parallel_maplines { lc($_) } "file.txt", 10;

Three parameters are requied: a coderef, filename, and number of simultanious child threads to use.

This function performs similar to maplines(), except that it does its' operations in parallel using fork() and Parallel::WorkUnit. Because the code in the coderef is executed in a child process, any changes it makes to variables in high scopes will not be visible outside that single child. In general, it will be safest to not modify anything that belongs outside this scope.

If a large amount of data is returned, the overhead of passing the data from child to parents may exceed the benefit of parallelization. However, if there is substantial line-by-line processing, there likely will be a speedup, but trivial loops will not speed up.

Note that the file will be read in several chunks, with each chunk being processed in a different thread. This means that the child threads may be operating on very different sections of the file simultaniously and no specific order of execution of the coderef should be expected! However, the results will be returned in the same order as maplines() would return them.

Otherwise, this function is identical to maplines().

readlines

  my (@result) = readlines "file.txt";

This function simply returns an array of lines (without newlines) read from a file.

AUTHOR

Joelle Maslak <jmaslak@antelope.net>

COPYRIGHT AND LICENSE

This software is copyright (c) 2018 by Joelle Maslak.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.