The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

File::MergeSort - Mergesort ordered files.

SYNOPSIS

 use File::MergeSort;

 # Create the MergeSort object.
 my $sort = File::MergeSort->new(
                [ $file_1, ..., $file_n ],  # Anonymous array of input files
                \&extract_function,         # Sub to extract merge key
                );


 # Retrieve the next line for processing
 my $line = $sort->next_line;
 print $line, "\n";

 # Dump remaining records in sorted order to a file.
 $sort->dump( $file );    # Omit $file to default to STDOUT

DESCRIPTION

File::MergeSort provides methods to merge and process a number of pre-sorted files into a single sorted output.

Merge keys are extracted from the input lines using a user defined subroutine. Comparisons on the keys are done lexicographically.

If IO::Zlib is installed, both plaintext and compressed (.z or .gz) files are catered for.

File::MergeSort is a hopefully straightforward solution for situations where one wishes to merge data files with presorted records. An example might be application server logs which record events chronologically from a cluster.

POINTS TO NOTE

ASCII order merging

Comparisons on the merge keys are carried out lexicographically. The user should ensure that the subroutine used to extract merge keys formats the keys if required so that they sort correctly.

Note that earlier versions (< 1.06) of File::MergeSort performed numeric, not lexicographical comparisons.

IO::Zlib is optional

IO::Zlib is no longer a prerequisite. If IO::Zlib is installed, File::MergeSort will use it to handle compressed input files.

If IO::Zlib is not installed and compressed files are specified as input files, File::MergeSort will raise an exception.

If you do not need to process compressed files, there is no longer any need install IO::Zlib to use File::MergeSort.

DETAILS

The user is expected to supply a list of file pathnames and a function to extract an index value from each record line (the merge key).

By calling the "next_line" or "dump" function, the user can retrieve the records in an ordered manner.

As arguments, File::MergeSort takes a reference to an anonymous array of file paths/names and a reference to a subroutine that extracts a merge key from a line.

The anonymous array of the filenames are the files to be sorted with the subroutine determining the sort order.

For each file File::MergeSort opens the file using IO::File or IO::Zlib for compressed files. File::MergeSort handles mixed compressed and uncompressed files seamlessly by detecting for files with .z or .gz extensions.

When passed a line (a scalar, passed as the first and only argument, $_[0]) from one of the files, the user supplied subroutine must return the merge key for the line.

The records are then output in ascending order based on the merge keys returned by the user supplied subroutine. A stack is created based on the merge keys returned by the subroutine.

When the next_line method is called, File::MergeSort returns the line with the lowest merge key/value.

File::MergeSort then replenishes the stack, reads a new line from the corresponding file and places it in the proper position for the next call to next_line.

If a simple merge is required, without any user processing of each line read from the input files, the dump method can be used to read and merge the input files into the specified output file, or to STDOUT if no file is specified.

CONSTRUCTOR

new( ARRAY_REF, CODE_REF );

Create a new File::MergeSort object.

There are two required arguments:

A reference to an array of files to read from. These files can be either plaintext, or compressed. Any file with a .gz or .z suffix will be opened using IO::Zlib.

A code reference. When called, the coderef should return the merge key for a line, which is given as the only argument to that subroutine/coderef.

METHODS

next_line( );

Returns the next line from the merged input files.

dump( [ FILENAME ] );

Reads and merges from the input files to FILENAME, or STDOUT if FILENAME is not given, until all files have been exhausted.

Returns the number of lines output.

EXAMPLES

  # This program looks at files found in /logfiles, returns the
  # records of the files sorted by the date in mm/dd/yyyy format

  use File::MergeSort;

  my $files = [ qw( logfiles/log_server_1.log
                    logfiles/log_server_2.log
                    logfiles/log_server_3.log
                ) ];

  my $sort = File::MergeSort->new( $files, \&index_sub );

  while (my $line = $sort->next_line) {
     # some operations on $line
  }

  sub index_sub{

    # Use this to extract a date of the form mm-dd-yyyy.

    my $line = shift;

    # Be cautious that only the date will be extracted.
    $line =~ /(\d{2})-(\d{2})-(\d{4})/;

    return "$3$1$2";  # Index is an integer, yyyymmdd
                      # Lower number will be read first.
  }



  # This slightly more compact example performs a simple merge of
  # several input files with fixed width merge keys into a single
  # output file.

  use File::MergeSort;

  my $files   = [ qw( input_1 input_2 input_3 ) ];
  my $extract = sub { substr($_[0], 15, 10 ) };  # To substr merge key out of line

  my $sort = File::MergeSort->new( $files, $extract );

  $sort->dump( "output_file" );

TODO

 + Implement a generic test/comparison function to replace text/numeric comparison.
 + Implement a configurable record separator.
 + Allow for optional deletion of duplicate entries.
 + Ensure input is really in correct sort order - currently upto the user.

EXPORTS

Nothing: OO interface. See CONSTRUCTOR and METHODS.

COPYRIGHT AND LICENSE

Copyright (c) 2001-2006 various authors.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHORS

Original Author

Christopher Brown <ctbrown@cpan.org>.

Maintainer

Barrie Bremner http://barriebremner.com/.

Contributors

Laura Cooney.

SEE ALSO

perl, IO::File, IO::Zlib, Compress::Zlib.

File::Sort or Sort::Merge as possible alternatives.