The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Sort::External::Cookbook - sort huge lists efficiently

DESCRIPTION

Sort::External has a special affinity for the packed-default sort, which first came to the author's attention via a paper by Uri Guttman and Larry Rosler, "A Fresh Look at Efficient Perl Sorting", located as of this writing at .

For most applications, the packed-default is the most efficient Perl sorting transform. This document explores how to apply it to solve common coding problems.

Most of the material herein is also applicable to building a packed-default sort with Perl's built-in sort() function.

The Packed-Default Sort

Eliminate the sortsub!

Perl's sortsubs are extremely flexible and powerful, but they are also computationally expensive. If you can eliminate the sortsub entirely from your algo, major optimizations kick in as Perl decides that it can perform the whole sort at the C level without leaving to execute Perl code.

Like Perl's native sort() function, Sort::External really, really likes to use the default ascending lexical sort. If you are chasing efficiency, you should do everything you can to avoid supplying Sort::External with a sortsub.

Encode, sort, decode.

The standard technique for sorting a hash by value in Perl is to use a sortsub:

    my %colors = (
        "Granny Smith"     => "green",
        "Golden Delicious" => "yellow",
        "Pink Lady"        => "pink",
        );
    my @sorted_by_color = sort { $colors{$a} cmp $colors{$b} } keys %colors;
    
    ### Result:
    ### @sorted_by_color = ("Granny Smith", "Pink Lady", "Golden Delicious");

To accomplish the same goal using a packed-default sort, we use a three-step process:

    ### 1. Encode
    my @sortkeys;
    while (my ($variety, $color) = each %colors) {
        push @sortkeys, sprintf("%-6s%-16s", $color, $variety);
    }
    ### @sortkeys = (
    ###     "green Granny Smith    ",
    ###     "yellowGolden Delicious",
    ###     "pink  Pink Lady       ",
    ###     );
    
    ### 2. Sort
    my @sorted_by_color = sort @sortkeys; # no sortsub!
    
    ### 3. Decode
    @sorted_by_color = map { substr($_, 5) } @sorted_by_color;
    

Using Sort::External, steps 2 and 3 are modified:

    ### 1. Encode
    my @sortkeys;
    while (my ($variety, $color) = each %colors) {
        push @sortkeys, sprintf("%-6s%-16s", $color, $variety);
    }
    
    ### 2. Sort
    my $sortex = Sort::External->new( -mem_threshold => 2**24 );
    $sortex->feed( @sortkeys );
    $sortex->finish;
    
    ### 3. Decode
    my @sorted_by_color; 
    while (defined($_ = $sortex->fetch)) {
        push @sorted_by_color, substr($_, 5);
    }

For the sake of brevity, we'll use sort() instead of Sort::External for most of our examples.

Encoding techniques

sprintf

The sprintf technique for encoding sortkeys is the most straightforward and flexible. In addition to preparing multi-field text data for a lexical sort, it can be used to put numerical data into a string form where the lexical sort doubles as a numerical sort.

    my %prices = (
        "Granny Smith"     => 1.69,
        "Pink Lady"        => 1.99,
        "Golden Delicious" =>  .79, # on sale!
        );
    my @sortkeys;
    while (my ($variety, $price) = each %prices) {
        push @sortkeys, sprintf("%1.2f%-16s", $price, $variety);
    }
    ### @sortkeys = (
    ###     "1.69Granny Smith    ",
    ###     "0.79Golden Delicious",
    ###     "1.99Pink Lady       ",
    ###     );

There are two potential problems with the sprintf technique.

First, you must know the maximum width of all fields in advance. If you don't, you'll have to run through your data in a wasteful "characterization pass" solely for the purpose of measuring field widths.

Second, if the fields have a long maximum width but the content is sparse, sprintf-generated sortkeys take up a lot of unnecessary space. With Sort::External, that means extra disk i/o and reduced performance.

null-termination

An alternative to using sprintf to create fixed-width data is to separate fields with nulls. Use the two-byte Unicode null unless you are absolutely sure you don't need it:

    while (my ($variety, $color) = each %colors) {
        push @sortkeys, "$color\0\0$variety";
    }
    my @sorted = sort @sortkeys;
    s/.*?\0\0// for @sorted;
    

Null-termination often saves space over sprintf -- however, if your data may contain null bytes, you have to be very careful about how you use it.

pack

While pack() can be used as a less-flexible version of sprintf for encoding fixed-width strings, it can also encode positive integers into a sortable form. The four integer pack formats which sort correctly are:

    ### c => signed char           maximum value 127
    ### C => unsigned char         maximum value 255
    ### n => 16-bit network short  maximum value 32_767
    ### N => 32-bit network long   maximum value 2_147_483_647
    
    my %first_introduced = (
        "Granny Smith"     => 1868,
        "Pink Lady"        => 1970,
        "Golden Delicious" => 1890,
        );
    while (my ($variety, $year) = each %first_introduced) {
        push @sortkeys, (pack('n', $year) . $variety);
    }
    my $sortex = Sort::External->new( -mem_threshold  => 2**24 );
    $sortex->feed(@sortkeys);
    $sortex->finish;
    my @sorted; 
    while (defined ($_ = $sortex->fetch)) {
        push @sorted, substr($_, 2);
    }
    

Combining encoding techniques

It is often the case that a combination of techniques yields the most efficient sortkey encoding and decoding. Here is a contrived example:

    my %apple_stats = (
        "Granny Smith"     => {
                color => "green", 
                price => 1.69, 
                year  => 1868,
            },
        "Pink Lady"        => {
                color => "pink",
                price => 1.99, 
                year  => 1970,
            },
        "Golden Delicious" => {
                color => "yellow",
                price =>  .79, 
                year  => 1890,
            },
        );
    my @sortkeys;
    while (my $variety, $stats) = each %apple_stats) {
        push @sortkeys, (
                $stats->{color} . "\0\0" .
                sprintf("%1.2f", $stats->{price}) .
                pack('n', $stats->{year}) .
                "\0\0" . $variety
            );
    }
    my @sorted = sort @sortkeys;
    my @sorted_varieties = map { s/.*\0\0//; $_ } @sorted;

sprintf would be a much better choice in this particular case, since it makes it easier to recover the all the fields using unpack:

    while (my ($variety, $stats) = each %apple_stats) {
        push @sortkeys, (
            sprintf("%-6s%1.2f%04d%-16s", 
                @{ $stats }{'color','price','year'}, $variety
            );
    }
    my @sorted = sort @sortkeys;
    for (@sorted) {
        my (%result, $variety); 
        (@result{'color','price','year'}, $variety) 
            = unpack('a6 a4 a4 a16', $_);
        $_ = { $variety => \%result };
    }

Extended sorting techniques

Descending sort order using the "bitwise not" operator

You can reverse the order of a lexical sort by flipping all the bits on your sortkeys.

    while (my ($variety, $color) = each %colors) {
        push @sortkeys, sprintf("%-6s%-16s", $color, $variety);
    }
    @sortkeys = map { ~ $_ } @sortkeys;
    my @sorted = sort @sortkeys;
    @sorted = map { ~ $_ } @sorted;
    @sorted = map { substr($_, 5) } @sorted;

Multi-field sorting with mixed asc/desc order

Multi-field packed-default sorts are usually as straightforward as concatenating the fields using either sprintf or null-termination. However, if you need both ascending and descending sub-sorts within the same sort, special care is required. Null-termination is generally not an option, because bit-flipped fields may contain null bytes.

    ### sort by:
    ###     color descending,
    ###     price ascending,
    ###     year  descending
    while (my $variety, $stats) = each %apple_stats) {
        push @sortkeys, (
                (~ sprintf("%-6s", stats->{color}) ) .
                sprintf("%1.2f", $stats->{price}) .
                (~ sprintf("%04d", stats->{year}) ) .
                sprintf("%-16s", $variety);
            );
    }
    my @sorted = sort @sortkeys;

Case-insensitive sorting

Case-insensitive sorting is something of a weakness for the packed-default sort. The only way to avoid a sortsub is to double-encode:

    while (my ($variety, $color) = each %colors) {
        push @sortkeys, ( lc($variety) . "\0\0" . $variety );
        push @sortkeys, ( lc($color)   . "\0\0" . $color   );
    }
    my @sorted = sort @sortkeys;
    @sorted = map { s/.*?\0\0//; $_ } @sorted;
    ### @sorted = ( 
    ###     "Golden Delicious",
    ###     "Granny Smith",
    ###     "green",
    ###     "pink",
    ###     "Pink Lady",
    ###     "yellow",
    ###     );

A challenge

Ken Clarke has identified a case where there seems to be no elegant packed-default sort solution:

    ### Each row contains multiple fields
    ### AND is impossible to know the field lengths in advance
    ### AND there are both ascending and descending subsorts
    ### AND (the sort must be (case-insensitive AND case-preserving))

There doesn't appear to be a way to perform this sort without a characterization pass.

If you come up with an efficient solution, please write the author.

ACKNOWLEDGEMENTS

This document arose out of discussions with Ken Clarke.

Uri Guttman and Larry Rosler, as mentioned above.

SEE ALSO

Sort::External

AUTHOR

Marvin Humphrey <marvin at rectangular dot com> http://www.rectangular.com

COPYRIGHT

Copyright (c) 2005 Marvin Humphrey. All rights reserved. This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.