NAME
Hustle::Table - Pattern dispatch to code with optional optimisation
SYNOPSIS
use Hustle::Table;
#1. Create a new object
my $table=Hustle::Table->new;
#2. Add entries which contain:
# matcher: required. The matcher (ie regex, string, number) to test with
# sub: required. The sub to call when matcher is 'true' when testing input
# label: optional. For identification
# count: optional. Used as priority when adding (larger number=> higher priority)
# Can either be a array ref or hash ref
$table->add(
#entry as a hash ref
{ matcher => qr/regex (match)/, sub => sub{ #can access regex capture }},
#entry as array ref, with label, priority values and context set
[qr/another/, sub{}, "my label", 10,{}],
#undef matcher is default match all. Must be array ref format
[undef,sub {},"default",undef,undef]
);
#3. Prepare a dispatcher:
# A dispatcher is sub reference, which is called directly with input to match
my $dispatch = $table->prepare_dispatcher(
type => "online", #either "online" (default="online") or "offline"
cache => {}, #the hash to use as a cache
reorder => 1, #reorder (default=1) table before building dispatcher
reset => 0 #Reset (default=undef) entry counts
);
#4. Dispatch input
$dispatch->("thing to match", "optional", "arguments");
#The dispatched sub is called with the match entry and then all argument from the second onwards
#5. Offline optimisation is also possible
DESCRIPTION
This module provides a small class to constructs a dispatch table and
builds a dispatcher for it. All interactions are done via the
object/class methods so no exports are defined.
It supports straightforward optimisation of the dispatch table and an
optional cache to squeeze even more performance out.
Notable features include:
* Captures in regex are available in dispatched subs
* Arguments supplied to dispatcher (except first) are available to
executed vector sub
* Cached pre-matching (optional)
* Basic hit count and optimising (optional)
* Fall through/catch all matching
The dispatch table is essentially a list of at least one entry which
maps a matcher to a subroutine which is called on a successful match of
the input.
Conceptually the list is looped over, applying the matchers sequentially
to the input until a match is found. In practice it isn't a loop, but
generated code reference with potentially optimised ordering of the
entries.
In the case of no successful match, a default catch all dispatch vector
is called.
Matching performance is optionally boosted by using a hash as a cache.
Hash lookup is much quicker than repeated conditional testing.
Controlling which inputs are removed from the cached is dictated by the
return value of the dispatch vector/sub
VERSION DIFFERENCES
Version 0.4.0 and later "given"/"when" and thus the use of smart
matching is removed. This gives a nice performance improvement of up to
20% over previous versions, however does mean the features of smart
matching (i.e. allowing string equality test ) is no longer supported.
To match an exact string you will need to make a regex matcher like
"^exact$".
Version 0.3.0 and later The input string (the first arg to the
dispatcher) is removed and replaced with the matching table entry when
calling a dispatched sub. In Previous versions, all arguments would be
passed to the dispatched sub unchanged.
If you need the input string that was being tested, pass it a second
time to the dispatcher:
#From v0.3.0 Onwards
$dispatcher->($input, $input, $another, $arg. ..._)
CREATING A TABLE
Simply calling the class constructor returns a new Table. There are no
arguments required for the constructor:
my $table=Hustle::Table->new;
In this case, a default catch all entry (an empty sub) is added
automatically.
Alternatively, a "sub" and "ctx" argument can be provided:
my $table=Hustle::Table->new(sub{}, $ctx);
This will overwrite the default entry in the table
The table is a blessed array reference and can be manipulated as such.
It's not recommended to access the table directly other than dumping the
contents to storage.
ENTRIES MANAGEMENT
STRUCTURE
An entry contains the following fields
matcher
"matcher" can be anything that smart-matching can handle. However
the focus on this module is on strings and regular expressions
"match exaclty this stirng"
qr|match and capture (this|that)|
When "matcher" is a regex, any capturing is accessible in the target
"sub" via the dynamically scoped numbered capture group variables
(i.e. $1 et al.)
sub The "sub" has access to any capture groups used in the matcher (if
applicable). It is also passed the arguments supplied to call the
dispatcher, except for the first. The first is replaced with the
matching entry in the table, to allow tracing.
$dispatcher->("my input","optional", "arguments", "to", "dispatched", "sub");
The return value of the sub indicates if the input matched is to be
removed from the cache.
label
"label" is a user defined item to allow identification of the entry.
This is useful for saving/loading from configuration files etc.
count
"count" This is a dual purpose attributes. When adding entries to
the list, it is used as a priority. Higher numeric values are a
higher priority. The list is sorted in descending order of priority,
meaning the highest priority is the first element in the list.
During running of the dispatch, this is a tally recording how many
times,the entry has been matched. This information can be then used
as a priority later to reorder the list.
ctx "ctx" Represents a user context which is applicable to the matcher.
It would usually be a weak reference but can be any scalar.
The dispatched subroutine has access to this value via the matcher
entry, which is the first argument to the called subroutine
ADDING
Entries are added in anonymous hash, anonymous array or flattened
format, using the "add" method.
Array entries must contain four elements, in the order of:
$table->add([$matcher, $sub, $label, $count, $ctx]);
Hashes ref format only need to specify the matcher and sub pairs
$table->add({matcher=>$matcher, sub=>$sub, label=>$label, count=>$count, ctx=>$ctx});
Single flattened format takes a list directly. It must contain 4
elements
$table->add(matcher=>$matcher, sub => $sub);
Single simple format takes two elements
$table->add(qr{some matcher}=>sub { say "run me" })
Or add multiple at once using mixed formats together
$table->add(
[$matcher, $sub, $label, $count, $ctx],
{matcher=> $matcher, sub=>$sub},
matcher=>$matcher, sub=>$sub
);
In any case,$matcher and $sub are the only items which must be defined.
$sub must also be a CODE reference.
If a label is not specified, one will be generated automatically. All
the labels used in a call to "add" are returned in the same order as the
input arguments.
Auto generated labels are only unique in the lifetime of a single table.
If permanent/globally unique labels are need, the user will need to
generate them.
REMOVING
Removal of entries is via the "remove" method. It takes a list of labels
to remove
$table->remove("label1", "funky-name");
It returns a list of all entries removed.
DUMPING
The table contents be accessed can like a normal array reference
$table->@*;
@$table;
It is not recommended to manipulate the entires directly
THE DEFAULT MATCHER
Each list has a default matcher that will unconditionally match the
input. This entry is specified by using "undef" as the matcher when
adding an entry. When set this way only the array format can be used.
To make it more explicit, the it can also be changed via the
"set_default" method.
The default subroutine of the 'default' entry does nothing.
PREPARING A DISPATCHER
Once all the entries required are added to the table, the dispatcher can
be constructed by calling "prepare_dispatcher":
my $dispatcher=$table->prepare_dispatcher(%args);
Arguments to this method include:
type
The type of dispatcher. Either "online" or "offline". If not
specified, or an invalid value is supplied, an "online" dispatcher
is created
online
Dispatcher which calls the dispatch vector, increases count
statistics. This is what you normally would want to use in your
code. Online mode.
offline
Dispatcher which only updates the count statistics. DOES NOT
call dispatch vector. Useful for running a input data set
through to obtain priority levels. Offline training mode.
cache
The hash ref to use as the dispatchers cache
reorder
Flag indicating the table should be reordered/optimised before
building the dispatcher.
reset
Flag specifying if counter statistics should be reset before
building a dispatcher
If no arguments are provided, the dispatcher will be created with the
following defaults:
my $dispatcher=$table->prepare_dispatcher(type=>"online", cache=>undef, reset=>undef, reorder=>1);
USING A DISPATCHER
The dispatcher is simply a sub. Any arguments passed to it are passed to
the dispatch vector as is.
$dispatcher->("input", "argument1",2);
PERFORMANCE
CACHING AND CACHE CONTROL
The return code of the dispatched sub is used to control if a successful
match should be removed from the cache, or not be removed;
Any 'true' value returned will means the input, as a key to the cache,
is to be removed from the cache.
Any 'false' value indicates the input is to remain in the cache or be
added if it doesn't exist.
{matcher => qr/r(.)gexp/, sub => sub { return }}; #Returns undef so not removed from cache.
{matcher => qr/re(.)gxp/, sub => sub { return 1}}; #Returns 'true' so removed from cache
{matcher => qr/reg(.)xp/, sub => sub { say $1;}}; #Returns true. Last statement is say and returns 1
OPTIMISATION
The concept is to perform less searching to find the dispatch vector.
However the type of input does play an important role in determining the
best way perform the search.
For example a uniformly distributed input will not gain benefits from
reordering the entries in the list.
On the other hand when the distributing becomes 'centred', the
reordering of the table can greatly improve the search time.
On my laptop, the simple benchmark script, (with 6 dispatch entries)
(regex and string) shows around 1.4M dispatches/s for non cached and
over 2M dispatches/s for cached dispatcher. Your results may vary.
COMPARISON TO OTHER MODULES
There a couple of other dispatching modules on CPAN. Notably
Smart::Dispatch has a nice syntax, probably more flexibility and the
ability to return values instead of just executing a sub.
This module is smaller/(much)faster in my basic tests and also supports
direct access to the regex capture groups and dispatch arguments.
BUGS
Probably.
Please report via github
TODO
* Write tests for removing entries
* Write tests for offline dispatcher
* Override array STORE and FETCH interface to ensure default is
preserved
* Add a more concrete performance discussion
* Document how to run offline dispatcher for optimising
AUTHOR
Ruben Westerberg, <drclaw@mac.com>
COPYRIGHT AND LICENSE
Copyright (C) 2021 by Ruben Westerberg
Licensed under MIT and GNU
DISCLAIMER OF WARRANTIES
THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.