The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Sort::External - sort huge lists

SYNOPSIS

    my $sortex = Sort::External->new( -mem_threshold = 2 ** 24 );
    while (<HUGEFILE>) {
        $sortex->feed( $_ );
    }
    $sortex->finish;
    while (defined($_ = $sortex->fetch)) {
        &do_stuff_with( $_ );
    }

DESCRIPTION

Problem: You have a list which is too big to sort in-memory. Solution: Use Sort::External, the closest thing to a drop-in replacement for Perl's sort() function when dealing with unmanageably large lists.

Where's the sortex() function?

In a perfect world, Sort::External would export a sortex() function that you could swap with sort() and be done. That isn't possible, because it would have to return a list which would, in all likelihood, be too large to fit in memory -- otherwise, why use Sort::External in the first place?

Replacing sort() with Sort::External

undefs and refs

When you replace sort() with the "feed, finish, fetch" cycle of a Sort::External object, you have to watch out for "undefs and refs".

Perl's sort() function sorts undef values to the front of a list -- it will complain if you have warnings enabled, but it preserves their undef-ness. sort() also preserves references. In contrast, Sort::External's behavior is unpredictable and almost never desirable when you feed it either undef values or references. If you really care about sorting lists containing undefs or refs, you'll have to symbollically replace and restore them yourself.

Subtle changes to scalars, e.g. utf8 flags get stripped

Sort::External writes items to disk and reads them back. The stringified return scalars will be match input strings exactly, but a scalar input which was previously marked as utf8 will not come back with that flag set. (There are other subtle changes, but they wouldn't matter unless you're doing XS programming, in which case you already know what to expect.)

Memory management

Sort::External functions best when it can accumulate a large input cache before sorting the cache and flushing it to disk. For backwards compatibility reasons, the default trigger for a buffer flush is the accumulation of 10,000 array items, which may be too high if your items are large. However, starting at version .10, Sort::External implements -mem_threshold, an *experimental* feature which makes it possible to flush to disk when the amount of memory consumed by the input cache exceeds a certain number of bytes.

There are two important caveats to keep in mind about -mem_threshold. First, Sort::External uses Devel::Size to assess memory consumption, which tends to underestimate real memory consumption somewhat -- see the docs for details. Second, the amount of memory consumed by the Sort::External process will be no less than double what you set for -mem_threshold, and probably considerably more. Don't get too aggressive, because the only time a very high -mem_threshold provides outsized benefits is when it's big enough that it prevents a disk flush -- in which case, you might just want to use sort().

METHODS

new()

    my $sortscheme = sub { $Sort::External::b <=> $Sort::External::a };
    my $sortex = Sort::External->new(
        -sortsub         => $sortscheme,      # default sort: standard lexical
        -working_dir     => $temp_directory,  # default: see below
        -mem_threshold   => 2 ** 24,          # default: 0 (inactive)
        -cache_size      => 100_000,          # default: 10_000
        );

Construct a Sort::External object.

  • -sortsub -- A sorting subroutine. Be advised that you MUST use $Sort::External::a and $Sort::External::b instead of $a and $b in your sub. Before deploying a sortsub, consider using a packed-default sort instead, as described in the Sort::External::Cookbook. It's probably faster.

  • -working_dir -- The directory where the temporary sortfiles will reside. By default, this directory is created using File::Temp's tempdir() command.

  • -mem_threshold -- EXPERIMENTAL FEATURE. Allow the input cache to grow to -mem_threshold bytes before sorting it and flushing to disk. Suggested value: Start somewhere between 2**20 and 2**24: 1-16 Mb.

  • -cache_size -- The size for Sort::External's input cache, in sortable items. If -mem_threshold is enabled, -cache_size is ignored.

feed()

    $sortex->feed( @items );

Feed one or more sortable items to your Sort::External object. It is normal for occasional pauses to occur as caches are flushed and sortfiles are merged.

finish()

    ### if you intend to call fetch...
    $sortex->finish; 
    
    ### otherwise....
    use Fcntl;
    $sortex->finish( 
        -outfile => 'sorted.txt',
        -flags => (O_CREAT | O_WRONLY),
        );

Prepare to output items in sorted order. If you haven't yet exceeded the cache size, Sort::External never writes to disk -- it just sorts the items in-memory.

If you specify the parameter -outfile, Sort::External will attempt to write your sorted list to that outfile. By default, Sort::External will refuse to overwrite an existing file; if you want to override that behavior, you can pass Fcntl flags to finish() using the optional -flags parameter.

Note that you can either finish() to an -outfile, or finish() then fetch()... but not both.

fetch()

    while (defined ($_ = $sortex->fetch)) {
        &do_stuff_with( $_ );
    }

Fetch the next sorted item.

DISCUSSION

"internal" vs. "external" sorting

In the CS world, "internal sorting" refers to sorting data in RAM, while "external sorting" refers to sorting data which is stored on disk, tape, or any storage medium except RAM. The main goal when implementing an external sorting algorithm is to minimize disk I/O. Sort::External's routine can be summarized like so:

Cache sortable items in memory. Every X items, sort the cache and empty it into a temporary sortfile. As sortfiles accumulate, interleave them periodically into larger sortfiles. Complete the sort by emptying the input cache then interleaving the contents of all existing sortfiles into an output stream.

BUGS

Please report any bugs or feature requests to bug-sort-external@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Sort-External.

ACKNOWLEDGEMENTS

Bug reports and patches: Ken Clarke, Chris Nandor.

SEE ALSO

The Sort::External::Cookbook.

File::Sort, File::MergeSort, and Sort::Merge as possible alternatives.

AUTHOR

Marvin Humphrey <marvin at rectangular dot com> http://www.rectangular.com

COPYRIGHT

Copyright (c) 2005 Marvin Humphrey. All rights reserved. This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.