The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Sort::Merge - general merge sort

SYNOPSIS

   use Sort::Merge;
   use IO::File;
   my @output;
   Sort::Merge::sort_coderefs([sub{our $i=0; return ($i++ x 2)},
                               sub{our $j=0; return ($j++*5 x 2)}],
                              sub{print shift->[1});

DESCRIPTION

Merge sorting is a technique for merging together multiple input streams, each of which is sorted, into one larger output stream, still sorted. This module implements that algorithm.

Note the large caveat above: the inputs must already be sorted. If the inputs aren't sorted, you'll need to sort them yourself first, in which case a merge sort may not be the right algorithm for the job.

Sort::Merge was designed to give you, the programmer, control over where your data comes from, and where it goes -- it's simply supposed to sort it, and not fetch it from files, parse out sort keys, etc, etc. There's at least one merge-sorting module already on CPAN, File::MergeSort, but I wanted one that I could use to sort together multiple files that were of different formats, and not line-oriented formats either.

The module I got out doesn't even require that the inputs are files at all, or that the output gets printed. It also streams the output, so you needn't wait until the sort is complete before printing the first piece of output. (The fact that this is possible at all is one of more useful properties of merge sorts vs. other sorts.)

Most (OK, at the moment all) of the available interfaces require you to write your input and output handlers in terms of coderefs. If this is too much work for you, I encourage you to not use this module, or, alternatively, to propose another interface to me. I'm actively looking for more and better interfaces.

Sources

There are two types of coderefs Sort::Merge makes you write, sources and the output.

Sources are sources of input data. They are called with no parameters. They are expected to return a two-element list if they have more data.

The first element should be a sort key, which will be compared lexographicly (using cmp). If you wish to think in terms of numerical sort keys, simply run it through chr, pack, or sprintf to get a representation that will sort properly. (Passing it to chr requires that the number is a nonnegative integer, and may warn for some values, but is quite fast.)

The second element may be any arbitrary scalar. This is your datapoint. It is passed to the output subroutine without being examined. It can be a line of text, an arrayref or hashref, or even an object, or undef for all Sort::Merge cares.

If the source has run out of data, it should return nothing (an empty list, not undef). If it wishes to signal an error and abort processing, it can simply die; the error will be propagated.

Output

Because I'm lazy, the output callback gets passed the source structure. That is, it gets an arrayref of three values. The first is the coderef, the second is the key, the third is the data.

To put it another way, it gets an arrayref of the coderef, followed by what the coderef has returned.

sort_coderefs

sort_coderefs takes an array-ref of source coderefs, and a single output coderef.

The output coderef is called for each element, in sorted order, and the function returns empty-list.

SEE ALSO

File::MergeSort, the source code.

COPYRIGHT AND DISCLAIMERS

Copyright (c) 2003 James M. Mastros. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

AUTHOR

James Mastros <james@mastros.biz>, theorbtwo on perlmonks