NAME

Sort::Maker - A simple way to make efficient sort subs

SYNOPSIS

use Sort::Maker ;

my $sorter = make_sorter( ... ) ;

DESCRIPTION

This module has two main goals: to make it easy to create correct sort functions, and to make it simple to select the optimum sorting algorithm for the number of items to be sorted. Sort::Maker generates complete sort subroutines in one of four styles, plain, orcish manouver, Schwartzian Transform and the Guttman-Rosler Transform. You can also get the source for a sort sub you create via the sorter_source call.

make_sorter

The sub make_sorter is exported by Sort::Maker. It makes a sort sub and returns a reference to it. You describe how you want it to sort its input by passing general options and key descriptions to make_sorter.

Arguments to make_sorter

There are two types of arguments, boolean and value. Boolean arguments can be set just with the option name and can optionally be followed by '1'. You can easily set multiple boolean general arguments with qw(). Value arguments must have a following value. Arguments can appear in any order but the key descriptions (see below) must appear in their sort order. The code examples below show various ways to set the various arguments.

Arguments fall into four categories: selecting the style of the sort, key descriptions, setting defaults for key description attributes, and setting general flags and values. The following sections will describe the categories and their associated arguments.

Sort Style

The style of the sort to be made is selected by setting one of the following Boolean arguments. Only one must be set otherwise an error is reported (see below for error handling). Also see below for detailed descriptions of the supported sort styles.

plain
orcish
ST
GRT

# Make a plain sorter
my $plain_sorter = make_sorter( qw( plain ) ... ) ;

# Make an orcish manouevre sorter
my $orcish_sorter = make_sorter( orcish => 1 ... ) ;

# Make a Schwartzian Transform sorter
my $st_sorter = make_sorter( 'ST', 1, ... ) ;

# Make a GRT sort
my $GRT = make_sorter( 'GRT', ... ) ;

Key Attribute Defaults

The following arguments set defaults for the all of the keys' attributes. These default values can be overridden in any individual key. Only one of the attributes in each of the groups below can be set as defaults or for any given key. If more than one attribute is each group is set, then make_sorter will return an error. The attribute that is the default for each group is marked. See below for details on key attributes.

ascending	(default)
descending

no_case		(default)
case

signed
unsigned
signed_float	(default)
unsigned_float

fixed
varying

General Options

These arguments set general options that apply to how the generated sorter interacts with the outside world.

name

This is a value option which set the name of the sorter sub. The sort sub will be created and exported to the caller's namespace with this name.

make_sorter( name => 'my_sorter', ... ) ;
@sorted = my_sorter @unsorted ;

ref_in/ref_out

This boolean arguments specifies that the input to and output from the sort sub will be array references. ref_in makes the sorter only take as input a single array reference (which contains the unsorted records). ref_out makes the sorter only return a single array reference (which contains the sorted records). You can set both of these options in a sorter.

Note: This does not affect key extraction code which still gets each record in $_. It only modifies the I/O of the generated sorter.

# input records are in an array reference
my $sorter = make_sorter( qw( ref_in ), ... ) ;
@sorted_array = $sorter->( \@unsorted_input ) ;

# sorted output records are in an array reference
my $sorter = make_sorter( ref_out => 1, ... ) ;
$sorted_array_ref = $sorter->( @unsorted_input ) ;

# input and output records are in array references
my $sorter = make_sorter( qw( ref_in ref_out ), ... ) ;
$sorted_array_ref = $sorter->( \@unsorted_input ) ;

string_data

This boolean argument specifies that the input records will be strings. It is only valid for use with the GRT and it is ignored for the other sort styles. It tells the GRT that it can put the record directly into the string cache. See the section below on the GRT for more detail.

init_code

This value argument is code that will be put into the beginning of the generated sorter subroutine. It is meant to be used to declare lexical variables that the extraction code can use. Normally different extraction code have no way to share common code. By declaring lexicals with the init_code option, then some key extraction code can save data there for use by another key. This is useful if you have two (or more) keys that share a complex piece of code such as accessing a deep value is a record tree.

For example if the input record is an array of arrays of hashes strings and the string has 2 keys that need to be grabbed by a regex. The string is a string key, a ':' and a number key. So the common part of the key extraction is:

$_->[0][0]{a}

And the make_sorter call is:

my $sorter = make_sorter( 
	init_code => 'my( $str, $num ) ;',
	string => 'do{( $str, $num ) =
		$_->[0][0]{a} =~ /^(\w+):(\d+)$/; $str}',
	number => '$num'
) ;

In that code both keys are extracted in the first key extraction code and the number key is saved in $num. The second key extraction code just uses that saved value.

Key Description Arguments

Sorting data requires that records be compared in some way so they can be put into a proper sequence. The parts of the records that actually get compared are called its keys. In the simplest case the entire record is the key, as when you sort a list of numbers or file names. But in many cases the keys are embedded in the full record and they need to be extracted before they can be used in comparisons. Sort::Maker uses key descriptions that denote the from the record, and optional other attributes that will help optimize the sorting operation. This section will explain how to pass key description arguments to the make_sorter subroutine and what the various attributes mean and how to best use them.

The generated sorter will sort the records according to the order of the key arguments. The first key is used to compare a pair records and if are deemed equal, then the next key is examined. This happens until the records are given an ordering or you run out of keys and the records are deemed equal in sort order. Key descriptions can be mixed with the other arguments which can appear in any order and anywhere in the argument list, but the keys themselves must be in the desired order.

A key argument is either 'string' or 'number' followed by optional attributes. The key type sets the way that the key is compared (e.g. using 'cmp' or '<=>'). All key attributes can be set from the default values in the global arguments or set in each individual key description.

There are 4 ways to provide attributes to a key:

No attributes

A key argument which is either at end of the argument list or is followed by a valid keyword token has no explict attributes. This key will use the default attributes. In both of these examples, a default attribute was set and used by the key description which is just a single key argument.

# sort the record as a single number in descending order
my $sorter = make_sorter( qw( plain number descending ) ) ;

# sort the record as a case sensitive string
my $sorter = make_sorter( qw( plain case string ) ) ;

# sort the record as a single number in ascending order
my $sorter = make_sorter( qw( ST number ) ) ;

Only Code as a Value

A key argument which is followed by a scalar value which is not a valid keyword token, will use that scalar value as its key extraction code. See below for more on key extraction code.

# sort by the first (optionally signed) number matched
my $sorter = make_sorter( qw( plain number /([+-]?\d+)/ ) ) ;

# string sort by the 3rd field in the input records (array refs)
my $sorter = make_sorter( 'ST', string => '$_->[2]' ) ;

An Array Reference

A key argument which is followed by an array reference will parse that array for its description attributes. As with the general boolean arguments, any boolean attribute can be optionally followed by a '1'. Value attributes must be followed by their value.

# another way to specify the same sort as above
# sort by the first (optionally signed) number matched

my $sorter = make_sorter(
	qw( plain ),
	number => [
		code => '/(\d+)/',
		'descending',
	],
) ;

# same sort but for the GRT which uses the 'unsigned'
# attribute to optimize the sort.

my $sorter = make_sorter(
	qw( GRT ),
	number => [
		qw( descending unsigned ),
		code => '/(\d+)/',
	],
) ;

A Hash Reference

A key argument which is followed a hash reference will use that hash as its description attributes. Any boolean attribute in the hash must have a value of '1'. Value attributes must be followed by their value.

# another way to specify the same sort as above
# sort by the first (optionally signed) number matched

my $sorter = make_sorter(
	qw( plain ),
	number => {
		code => '/(\d+)/',
		descending => 1,
	],
) ;

# a multi-key sort. the first key is a descending unsigned
# integer and the second is a string padded to 10 characters

my $sorter = make_sorter(
	qw( GRT ),
	number => {
		code => '/(\d+)/',
		descending => 1,
		unsigned => 1,
	},
	string => {
		code => '/FOO<(\w+)>/',
		fixed => 10,
	},
) ;

Key Description Attributes

What follows are the attributes for key descriptions. Most use the default values passes in the arguments to make_sorter.

code

This value attribute is the code that will be used to extract a key from the input record. It must be a string of Perl code that operates on $_ and extracts a value. The code will be wrapped in a do{} block and called in a list context so that regular expressions can just use () to grab a key value. The code defaults to $_ which means the entire record is used for this key. You can't set the default for code (unlike all the other key attributes). See the section on Extraction Code for more.

# make an ST sort of the first number grabbed in descending order

my $sorter = make_sorter(
	qw( ST ),
	number => {
		code	=> '/(\d+)/',
		descending => 1,
	},
) ;

ascending/descending

These two attributes control the sorting order for this key. If a key is marked as ascending (which is the initial default for all keys), then lower keys will sort before higher keys. descending sorts have the higher keys sort before the lower keys. It is illegal to have both set in the defaults or in any key.

# sort by descending order of the first grabbed number
# and then sort in ascending order the first grabbed <word>

my $sorter = make_sorter(
	qw( ST descending ),
	number => {
		code	=> '/(\d+)/',
	},
	number => {
		code	=> '/<(\w+)>/',
		ascending => 1,
	},
) ;

# this will return undef and store an error in $@. 
# you can't have both 'ascending' and 'descending' as defaults

my $sorter = make_sorter(
	qw( ST ascending descending ),
	number => {
		code	=> '/(\d+)/',
		descending => 1,
	},
) ;

# this will return undef and store an error in $@. 
# you can't have both 'ascending' and 'descending' in a key

my $sorter = make_sorter(
	qw( ST )
	number => {
		code	=> '/(\d+)/',
		descending => 1,
		ascending => 1,
	},
) ;

case/no_case

These two attributes control the how 'string' keys handle case sensitivity. If a key is marked as case (which is the initial default for all keys), then keys will treat upper and lower case letters as different. If the key is marked as no_case then they are treated as equal. It is illegal to have both set in the defaults or in any key. Internally this uses the uc() function so you can use locale settings to affect string sorts.

# sort by the first grabbed word with no case
# and then sort the grabbed <word> with case

my $sorter = make_sorter(
	qw( ST no_case ),
	string => {
		code	=> '/(\w+)/',
	},
	string => {
		code	=> '/<(\w+)>/',
		case => 1,
	},
) ;

# this will return undef and store an error in $@. 
# you can't have both 'case' and 'no_case' as defaults

my $sorter = make_sorter(
	qw( ST no_case case ),
	string => {
		code	=> '/(\w+)/',
	},
) ;

my $sorter = make_sorter(
	qw( ST ascending descending ),
	number => {
		code	=> '/(\d+)/',
		descending => 1,
	},
) ;

# this will return undef and store an error in $@. 
# you can't have both 'case' and 'no_case' in a key

my $sorter = make_sorter(
	qw( ST )
	string => {
		code	=> '/(\w+)/',
		no_case	=> 1,
		case	=> 1,
	},
) ;

signed/unsigned/signed_float/unsigned_float (GRT only)

These attributes are only used by the GRT sort style. They are meant to describe the type of a number key so that the GRT can best process and cache the key's value. It is illegal to have more than one of them set in the defaults or in any key. See the section on GRT sorting for more.

The signed and unsigned attributes mark this number key as an integer. The GRT does the least amount of work processing an unsigned integer and only slightly more work for a signed integer. It is worth using these attributes if a sort key is restricted to integers.

The signed_float (which is the normal default for all keys) and unsigned_float attributes mark this number key as a float. The GRT does the less work processing an unsigned float then a signed float. It is worth using the unsigned_float attribute if a sort key is restricted to non-negative values. The signed_float attribute is supported to allow overriding defaults and to make it easier to auto-generate sorts.

fixed/varying (GRT only)

These attributes are only used by the GRT sort style. They are meant to describe the type of a string key so that the GRT can properly process and cache the key's value. It is illegal to have more than one of them set in the defaults or in any key. See the section on GRT sorting for more.

fixed is a Boolean attribute that marks this string key as always being a fixed length. The extracted value will either be padded with null (0x0) bytes or truncated to the specified length. The data in this key can have embedded null bytes (0x0) and also it can be sorted in descending order.

varying is a Boolean attribute marks this string key as being of varying lengths. The GRT sorter will do a scan of all of this key's values to find the maximum string length and then it pads all the extracted values to that length. The data in this key can have embedded null bytes (0x0) and also it can be sorted in descending order.

Key Extraction Code

Each input record must have its sort keys extracted from the data. This is the purpose of the 'code' attribute in key descriptions. The code must be a single string which operates on a record which is in $ and it must return the key value. The code is executed in a list context so you can use grabs in m// to return the key. Note that only the first grab will be used but you shouldn't have more than one anyway. See the examples below.

Code can be either a string or a qr// (Regexp) object. If qr// is used, the actual generated code will be m($qr) which works because qr// will interpolate to its string representation. The advantage of qr// over a string is that the qr// will be syntax checked at compile time while the string only later when the generated sorter is compiled by an eval.

Extraction code for a key can be set in one of three ways.

No explicit code

If you don't pass any extraction code to a key, it will default to $_ which is the entire record. This is useful in certain cases such as in simple sorts where you are sorting the entire record.

# sort numerically and in reverse order
my $sorter = make_sorter( qw( plain number descending ) ;

# sort with case folding
my $sorter = make_sorter( qw( plain no_case string ) ;

# sort by file time stamp and then by name
my $sorter = make_sorter( 'ST', number => -M, 'string' ) ;

Code is the only key attribute

In many cases you don't need to specify any specific key attributes (the normal or globally set defaults are fine) but you need extraction code. If the argument that follows a key type ( 'string' or 'number' ) is not a valid keyword, it will be assumed to be the extraction code for that key.

	# grab the first number string as the key
	my $sorter = make_sorter( qw( plain number /(\d+)/ ) ) ;

	# no_case string sort on the 3rd-5th chars of the 2nd array element
	my $sorter = make_sorter(
                plain	=> 1,
                no_case => 1,
                string	=> 'substr( $_->[1], 2, 3)'
	) ;

Key needs specific attributes

When the key needs to have its own specific attributes other than its code, you need to pass them in an ARRAY or HASH reference. This is mostly needed when there are multiple keys and the defaults are not correct for all the keys.

	# string sort by the first 3 elements of the array record with
        # different case requirements
	
	my $sorter = make_sorter(
                ST	=> 1,
                string	=> {
                        code	=> '$_->[0]',
                        no_case => 1,
                },
                string	=> '$_->[1]',
                string	=> {
                        code	=> '$_->[2]',
                        no_case => 1,
                },
	) ;

	# GRT sort with multiple unsigned integers and padded strings
	# note that some keys use a hash ref and some an array ref
	# the record is marked with key\d: sections
	my $sorter = make_sorter(
                GRT	=> 1,
                descending => 1,
                number	=> {
                        code	=> 'key1:(\d+)',
                        unsigned => 1,
                },
                number	=> [
                        code	=> 'key2:([+-]?\d+)',
                        qw( signed ascending ),
                ],
                string	=> [
                        code	=> 'key3:(\w{10})',
                        fixed => 1,
                        ascending => 1,
                ],
		# pad the extracted keys to 8 chars
                string	=> {
                        code	=> 'key4:([A-Z]+)',
                        pad => 8,
                },
	) ;

Key Caching

A good question to ask is "What speed advantages do you get from this module when all the sorts generated use Perl's internal sort function?" The sort function has a O( N * log N ) growth function which means that the amount of work done increases by that formula as N (the number of input records) increases. In a plain sort this means the the key extraction code is executed N * log N times when you only have N records. That can be a very large waste of cpu time. So the other three sort styles speed up the overall sort by only executing the extraction code N times by caching the extracted keys. How they cache the keys is their primary difference. To compare or study the actual code generated for the different sort styles, you can run make_sorter and just change the style. Then call sorter_source (not exported by default) and pass it the sort code reference returned by make_sorter. It will return the generated sort source.

plain

Plain sorting doesn't do any key caching. It is fine for short input lists (see the Benchmark section) and also as a way to see how much CPU is saved when using one of the other styles.

orcish

The Orcish maneuvre (created by Joseph Hall) caches the extracted keys in a hash. It does this with code like this:

$cache{$a} ||= CODE($a) ;

CODE is the extract code and it operates on a record in $a. If we have never seen this record before then the cache entry will be undef and the ||= operator will assign the extracted key to that hash slot. The next time this record is seen in a comparison, the saved extracted key will be found in the hash and used. The name orcish comes from OR-cache.

ST

The ST (Schwartzian Transform and popularized by Randal Schwartz) uses an anonymous array to store the record and its extracted keys. It first execute a map that creates an anonymous array:

map [ $_, CODE1( $_ ), CODE2( $_ ) ], @input

The CODE's extract the set of keys from the record but only once per record so it is O(N). Now the sort function can just do the comparisons and it returns a list of sorted anonymous arrays.

sort {
	$a->[1] cmp $b->[1]
		||
	$a->[2] cmp $b->[2]
}

Finally, we need to get back the original records which are in the first slot of the anonymous array:

map $_->[0]

This is why the ST is known as a map/sort/map technique.

GRT

The Guttman-Rosler Transform (popularized by Uri Guttman and Larry Rosler) uses a string to cache the extracted keys as well as either the record or its index. It is also a map/sort/map technique but because its cache is a string, it can be sorted without any Perl level callback (the {} block passed to sort). This is a signifigant win since that callback is running O( N log N). But this speedup comes at a cost of complexity. You can't just join the keys into a string and properly sort them. Each key may need to be processed so that it will correctly sort in order and it doesn't interfere with other keys. That is why the GRT has several key attributes to enable it to properly and efficiently pack the sort keys into a single string. The following lists the GRT key attributes, when you need them and what key processing is done for each. Note that you can always enable the GRT specific attributes as they are just ignored by the other sort styles.

GRT

The GRT gains its speed by using a single byte string to cache all of the extracted keys from a given input record. Packing keys into a string such that it will lexically sort the correct way requires some deep mojo and data munging. But that is why this module was written - to hide all that from the coder. Below are descriptions of how the various key types are packed and how to best use the GRT specific key attributes. Note: you can only use one of the GRT number or string attributes for any key. Setting more than one in either the defaults or in any given key is an error (a key's attribute can override a default choice).

unsigned

The 'unsigned' attribute tells the GRT that this number key is a non-negative integer. This allows the GRT to just pack it into 4 bytes using the N format (network order - big endian). An integer packed this way will have its most significant bytes compared before its least signifigant bytes. This involves the least amount of key munging and so it is the most efficient way to sort numbers in the GRT.

If you want this key to sort in descending order, then the key value is negated and normalized (see the 'signed' attribute) so there is no advantage to using 'unsigned'.

signed

The 'signed' attribute tells the GRT that this number key is an integer. This allows the GRT to just pack it into 4 bytes using the N format (network order - big endian). The key value must first be normalized which will convert it to an unsigned integer but with the same ordering as a signed integer. This is simply done by inverting the sign (highest order) bit of the integer. As mentioned above, when sorting this key in descending order, the GRT just negates the key value.

NOTE: In the GRT the signed and unsigned integer attributes only work on perl built with 32 bit integers. This is due to using the N format of pack with is specified to be 32 bits. A future version may support 64 bit integers (anyone want to help?).

unsigned_float

The 'unsigned_float' attribute tells the GRT that this number key is a non-negative floating point number. This allows the GRT to pack it into 8 bytes using the 'd' format. A float packed this way will have its most significant bytes compared before its least signifigant bytes.

signed_float

The signed_float attribute (which is the default for all number keys when using the GRT) tells the GRT that this number key is a floating point number. This allows the GRT to pack it into 8 bytes using the 'd' format. A float packed this way will have its most significant bytes compared before its least signifigant bytes. When processed this key will be normalized to an unsigned float similar to to the signed to unsigned conversion mentioned above.

NOTE: The GRT only works with floats that are in the IEEE format for doubles. This includes most modern architectures including x86, sparc, powerpc, mips, etc. If the cpu doesn't have IEEE floats you can either use the integer attributes or select another sort style (all the others have no restriction on float formats).

simple string.

If a string key is being sorted in ascending order with the GRT and it doesn't have one of the GRT string attributes, it will be packed without any munging and a null (0x0) byte will be appended to it. This byte enables a shorter string to sort before longer ones that starts with the shorter string.

NOTE: You cannot sort strings in descending order in the GRT unless the key has either the 'fixed' or 'varying' attributes set. Also if a string is being sorted in ascending order but has any null (0x0) bytes in it, the key must have one of those attributes set.

fixed

This boolean attribute tells the GRT to pack this key value as a fixed length string. The extracted value will either be padded with null (0x0) bytes or truncated to the specified length. This means it can be packed into the cache string with no padding and no trailing null byte is needed. The key can contain any data including null (0x0) bytes. The only data munging is if the key's sort order is descending, then the key value is xor'ed with a same length string of 0xff bytes. This toggles each bit which allows for a lexical comparison but in the reverse order. This same bit inversion is used for descending varying strings.

varying

This Boolean attribute tells the GRT that this key value is a varying length string and has no predetermined padding length. A prescan is done to determine the maximum string length for this key and that is used as the padding length. The rest is the same as with the 'fixed' attribute.

sorter_source

This sub (which can be exported) returns the source of a generated sort sub or the source of the last one that had an error. To get the source of an existing sort sub, pass it a reference to that sub (i.e. the reference returned from make_sorter). To get the source for a failed call to make_sorter, don't pass in any arguments.

my $sorter = make_sorter( ... ) ;
print sorter_source( $sorter ) ;

make_sorter( name => 'my_sorter', ... ) ;
print sorter_source( \&my_sorter ) ;

my $sorter = make_sorter( ... ) 
	or die "make_sorter error: $@\n", sorter_source();

If all you want is the generated source you can just do:

print sorter_source make_sorter( ... ) ;

Error Handling

When make_sorter detects an error (either bad arguments or when the generated sorter won't compile), it returns undef and set $@ to an error message. The error message will include the generated source and compiler and warning errors if the sorted didn't compile correctly. The test t/errors.t covers all the possible error messages. You can also retrieve the generated source after a compiling error by calling sorter_source.

TESTS

Sort::Maker uses a table of test configurations that can both run tests and benchmarks. Each test script is mostly a table that generate multiple versions of the sorters, generate sample data and compare the sorter results with a sort that is known to be good. If you run the scripts directly and with a -bench argument, then they generate the same sorter subs and benchmark them. This design ensures that benchmarks are running on correctly generated code and it makes it very easy to add more test and benchmark variations. The code that does all the work is in t/common.pl. Here is a typical test table entry:

{
	skip	=> 0,
	source	=> 0,
	name	=> 'init_code',
	gen	=> sub { rand_choice( @string_keys ) . ':' .
			 rand_choice( @number_keys ) },
	gold	=> sub {
		 ($a =~ /^(\w+)/)[0] cmp ($b =~ /^(\w+)/)[0]
		 		||
		 ($a =~ /(\d+$)/)[0] <=> ($b =~ /(\d+$)/)[0] 
	},
	args	=> [
		init_code => 'my( $str, $num ) ;',
		string => 'do{( $str, $num ) = /^(\w+):(\d+)$/; $str}',
		number => '$num',
	],
},

skip is a boolean that causes this test/benchmark to be skipped. Setting source causes the sorter's source to be printed out. gen is a sub that generated a single record. There are random data subs in t/common.pl. Some tests have a data field which is fixed data for a test (instead of the generated data). The <gold> field is a comparision subroutine usable by the sort function. It is used to sort the test data into a golden result which is used to compare against all the generated sorters. args is an anonmyous of arguments for a sorter or a hash ref with multiple named/args pairs. See t/io.t for an example of that.

BENCHMARKS

EXPORT

This module always exports the make_sorter sub. It can also optionally export sorter_source.

AUTHOR

Uri Guttman, <uri@stemsystems.com>

ACKNOWLEDGEMENTS

I would like to thank the inimitable Damian Conway for his help in the API design, the POD, and for being a good Perl friend.

And thanks to Boston.pm for the idea of allowing qr// for key extraction code.