Hustle::Table - Percomiled pattern dispatch to code with optional optimisation
use Hustle::Table; #1. Create a new object my $table=Hustle::Table->new; #2. Add entries which contain: # matcher: required. The matcher (ie regex, string, number) to test with # sub: required. The sub to call when matcher is 'true' when testing input # label: optional. For identification # count: optional. Used as priority when adding (larger number=> higher priority) # Can either be a array ref or hash ref $table->add( #entry as a hash ref { matcher => qr/regex (match)/, sub => sub{ #can access regex capture }}, #entry as array ref, with label and priority values set [qr/another/, sub{}, "my label", 10], #undef matcher is default match all. Must be array ref format [undef,sub {},"default",undef] ); #3. Prepare a dispatcher: # A dispatcher is sub reference, which is called directly with input to match my $dispatch = $table->prepare_dispatcher( type => "online", #either "online" (default="online") or "offline" cache => {}, #the hash to use as a cache reorder => 1, #reorder (default=1) table before building dispatcher reset => 0 #Reset (default=undef) entry counts ); #4. Dispatch input $dispatch->("thing to match","optional", "arguments"); # any number of arguments passsed to sub #5. Offline optimisation is also possible
This module provides a small class to construct a dispatch table and build a dispatcher for it. All interactions are done via the object/class methods so no exports are defined.
It supports straightforward optimisation of the dispatch table and also utilises a optional cache to squeeze even more performance out.
Notable features include:
Non exact (regexp) and exact (string) matching in the same table
Uses built-in Perl features (regex, given/when, smart-match)
Captures in regex are available in dispatched subs
Arguments supplied to dispatcher are available to executed vector
Cached pre-matching (optional)
Basic hit count and optimising (optional)
Fall through/catch all matching
The dispatch table is essentially a list of at least one entry which maps a matcher to a subroutine to call on a successfully match of the input.
Conceptually the list is looped over, applying the matchers sequentially to the input until a match is found. In practice it isn't a loop, but generated code reference with optimised ordering of the entries.
In the case of no successful match, a default catch all dispatch vector is called.
Matching performance is optionally boosted by using a hash as a cache. Hash lookup is much quicker than repeated conditional testing. Controlling which inputs are removed from the cached is dictated by the return value of the dispatch vector/sub
Simply calling the class constructor returns a new Table. There are no arguments to the constructor.
my $table=Hustle::Table->new;
A default catch all entry (an empty sub) is added automatically
The table is a blessed array reference and can be manipulated as such. It's not recommended to access the table directly other than dumping the contents to storage.
An entry contains the following fields
matcher can be anything that smart-matching can handle. However the focus on this module is on strings and regular expressions
matcher
"match exaclty this stirng" qr|match and capture (this|that)|
When matcher is a regex, any capturing is accessible in the target sub.
sub
The sub has access to any capture groups used in the matcher (if applicable). It is also passed all arguments, including the input, unmodified, used to call the dispatcher.
$dispatcher->("my input","optional", "arguments", "to", "dispatched", "sub");
The return value of the sub indicates if the input just matched is to be removed from the cache.
label is a user defined item to allow identification of the entry. This is useful for saving/loading from configuration files etc.
label
count This is a dual purpose attributes. When adding entries to the list, it is used as a priority. Higher numeric values are a higher priority. The list is sorted in descending order of priority, meaning the highest priority is the first element in the list.
count
During running of the dispatch, this is a tally recording how many times,the entry has been matched. This information can be then used as a priority later to reorder the list.
Entries are added in anonymous hash, anonymous array or flattened format, using the add method.
add
Array entries must contain four elements, in the order of:
$table->add([$matcher,$sub,$label,$count]);
Hashes ref format only need to specify the matcher and sub pairs
$table->add({matcher=>$matcher, sub=>$sub, label=>$label, count=>$count});
Single flattened format takes a list directly. It must contain 4 elements
$table->add(matcher=>$matcher, sub => $sub);
Single simple format takes two elements
$table->add(qr{some matcher}=>sub { say "run me" })
Or add at once using mixed formats together
$table->add( [$matcher, $sub, $label, $count], {matcher=> $matcher, sub=>$sub}, matcher=>$matcher, sub=>$sub );
In any case,$matcher and $sub are the only items which must be defined. $sub must also be a CODE reference.
$matcher
$sub
If a label is not specified, one will be generated automatically. All the labels used in a call to add are returned in the same order as the input arguments.
Auto generated labels are only unique in the lifetime of a single table. If permanent/globally unique labels are need, the user will need to generate them.
Removal of entries is via the remove method. It takes a list of labels to remove
remove
$table->remove("label1", "funky-name");
It returns a list of all entries removed.
The table contents be accessed can like a normal array reference
$table->@*; @$table;
It is not recommended to manipulate the entires directly
Each list has a default matcher that will unconditionally match the input. This entry is specified by using undef as the matcher when adding an entry. When set this way only the array format can be used.
undef
To make it more explicit, the it can also be changed via the set_default method.
set_default
The default subroutine of the 'default' entry does nothing.
Once all the entries required are added to the table, the dispatcher can be constructed by calling prepare_dispatcher:
prepare_dispatcher
my $dispatcher=$table->prepare_dispatcher(%args);
Arguments to this method include:
The type of dispatcher. Either "online" or "offline". If not specified, or an invalid value is supplied, an "online" dispatcher is created
Dispatcher which calls the dispatch vector, increases count statistics. This is what you normally would want to use in your code. Online mode.
Dispatcher which only updates the count statistics. DOES NOT call dispatch vector. Useful for running a input data set through to obtain priority levels. Offline training mode.
The hash ref to use as the dispatchers cache
Flag indicating the table should be reorderd/optimisted before building the disptacher.
Flag specifying if counter statistics should be reset before building a dispatcher
If no arguments are provided, the dispatcher will be created with the following defaults:
my $dispatcher=$table->prepare_dispatcher(type=>"online", cache=>undef, reset=>undef, reorder=>1);
The dispatcher is simply a sub. Any arguments passed to it are passed to the dispatch vector as is.
$dispatcher->("input", "argument1",2);
The return code of the dispatched sub is used to control if a successful match should be removed from the cache, or not be removed;
Any 'true' value returned will means the input, as a key to the cache, is to be removed from the cache.
Any 'false' value indicates the input is to remain in the cache or be added if it doesn't exist.
{matcher => qr/r(.)gexp/, sub => sub { return }}; #Returns undef so not removed from cache. {matcher => qr/re(.)gxp/, sub => sub { return 1}}; #Returns 'true' so removed from cache {matcher => qr/reg(.)xp/, sub => sub { say $1;}}; #Returns true. Last statement is say and returns 1
The concept is to perform less searching to find the dispatch vector. However the type of input does play an important role in determining the best way perform the search.
For example a uniformly distributed input will not gain benefits from reordering the entries in the list.
On the other hand when the distributing becomes 'centred', the reordering of the table can greatly improve the search time.
On my laptop, the simple benchmark script, (with 6 dispatch entries) (regex and string) shows around 1.2M dispatches/s for non cached and over 2M dispatches/s for cached dispatcher. Your results may vary.
There a couple of other dispatching modules on CPAN. Notably Smart::Dispatch has a nice syntax, probably more flexibility and the ability to return values instead of just executing a sub.
This module is smaller/(much)faster in my basic tests and also supports direct access to the regex capture groups and dispatch arguments.
Probably.
Please report via github
Write tests for removing entries
Write tests for offline dispatcher
Override array STORE and FETCH interface to ensure default is preserved
Add a more concrete performance discussion
Document how to run offline dispatcher for optimising
Ruben Westerberg, <drclaw@mac.com>
Copyright (C) 2021 by Ruben Westerberg
Licensed under MIT and GNU
THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
To install Hustle::Table, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Hustle::Table
CPAN shell
perl -MCPAN -e shell install Hustle::Table
For more information on module installation, please visit the detailed CPAN module installation guide.