The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DiaColloDB::Client::list - diachronic collocation db: client: distributed

DESCRIPTION

DiaColloDB::Client::list is a subclass of DiaColloDB::Client for accessing a set of distributed DiaColloDB databases via a list:// URL whose path part is a space- or colon-separated list of sub-URLs supported by DiaColloDB::Client. It supports the DiaColloDB::Client API by calling the relevant methods on each of its sub-clients.

new() options and object structure:

 ##-- DiaColloDB::Client: options
 url  => $url,       ##-- list url (sub-urls, separated by whitespace, "+SCHEME://", or "+://")
 ##
 ##-- DiaColloDB::Client::list
 urls  => \@urls,     ##-- sub-urls
 opts  => \%opts,     ##-- sub-client options
 fudge => $fudge,     ##-- get ($fudge*$kbest) items from sub-clients (-1:all; 0|1:none; default=10)
 fork => $bool,       ##-- run each subclient query in its own fork? (default=if available)
 lazy => $bool,       ##-- use temporary on-demand sub-clients (true,default) or persistent sub-clients (false)
 extend => $boo,      ##-- use extend() queries to acquire correct f2 counts? (default=true)
 logFudge => $level,  ##-- log-level for fudge-coefficient debugging (default='debug')
 logThread => $level, ##-- log-level for thread operations (default='none')
 ##
 ##-- guts
 #clis => \@clis,     ##-- per-url sub-clients for "busy" (non-"lazy") mode

The most important client parameter is the fudge-coefficient option fudge=>$fudge, which requests that up to $fudge*$kbest items be retrieved from sub-clients for each profile() call. If $fudge < 0, all collocates will be retrieved from each sub-client, and trimming will be performed exclusively by the superordinate DiaColloDB::Client::list object. If $fudge == 0, only the $kbest collocates from each sub-client will be retrieved. The default value of 10 should return reasonable results without too large of a performance penalty in most cases, but be aware that the results for $fudge > 0 may not be strictly correct due to sub-client local pruning; see for details.

This module supports parallel processing of sub-client queries using whatever threading implementation (if any) is provided by the DiaColloDB::threads module. Parallel sub-client processing is enabled by default if a working threads or forks module was found by DiaColloDB::threads, but can be disabled by specifying the fork=>0 option to the list-client.

List URLs

List URLs passed as the the url option to the constructor can be either ARRAY-refs of sub-URLs or simple strings with an optional list:// scheme. In the latter case, sub-URLs in the argument string are separated by whitespace or by a plus character ("+") followed by the sub-URL scheme, e.g.:

 ["file://a","file://b"]        ##-- ARRAY-ref of explicit file URLs
 ["a"       , "b"      ]        ##-- ARRAY-ref of implicit file URLs
 
 "list://file://a file://b"     ##-- string with space-separated explicit file URLs
 "list://a b"                   ##-- string with space-separated implicit file URLs
 
 "list://file://a+file://b"     ##-- list with "+"-separated explicit file URLs
 "list://a+://b"                ##-- list with "+"-separated implicit file URLs

Options can be passed to the appropriate sub-URLs via those URLs' query strings, as described in "open" in DiaColloDB::Client. Options to the DiaColloDB::Client::list object itself can be passed in by using a sub-URL consisting of a HASH-ref or only a query string, e.g.:

 ["a","b",{fudge=>0}]           ##-- ARRAY-ref with local options as HASH-ref
 ["a","b","?fudge=0"]           ##-- ARRAY-ref with local options as query-string
 
 "list://a b ?fudge=0"          ##-- space-sparated string with local options
 "list://a+://b+://?fudge=0"    ##-- "+"-separated string with local options

KNOWN BUGS

Incorrect Independent Collocate Frequencies

Prior to the introduction of extend() queries in DiaCollODB v0.11.000, the list-clients were always apt to return incorrrect independent collocate frequencies f2 whenever the queried subcorpora were not partitioned explicitly by date, even with $fudge=-1. Although the reported joint frequencies f12 ought to have been correct in this case, it could easily happen that the independent collocate frequencies f2 got mis-reported, leading to incorrect computations of f2-sensitive association scores such as milf (pointwise mutual information * log-frequency product), ll (log likelihood), or the default ld (log Dice). Such errors occurred whenever the pre-v0.11.000 list client accessed multiple sub-clients (e.g. $a and $b) and some candidate collocate $v occured in both of the subcorpora, but only occured together with the target term $w in one of the sub-clients' indices.

Suppose $v occurs in subcorpus $a with frequency f_a($v) and in subcorpus $b with frequency f_b($v), but only occurs together with $w in subcorpus $a with frequency f_a($w,$v); i.e. f_b($w,$v)==0. Since only collocates with nonzero co-occurrence frequencies are collected in subcorpus profiles, the sub-profile for $w over subcorpus $b will not contain an entry for $v at all. This is fine if we are only interested in the total co-occurrence frequency f($w,$v) = f_a($w,$v) + f_b($w,$v), but if we are using an "interesting" association score, we also need to refer to the total independent collocate frequency f($v) = f_a($v) + f_b($v), but since f_b($v) will not have been reported by the subprofile for corpus $b, its value will be treated as 0 (zero), leading to an incorrect estimate of the association score.

As of v0.11.000, each list client is queried a second time using its extend() method to acquire independent collocate frequencies for "missing" keys such as $v in the example above. This introduces additional processing overhead, which can be disabled by setting the extend=>0 option to the list-client to simulate the old, incorrect, pre-v0.11 behavior.

Incorrect Joint Frequencies

Similar to the case for independent collocate frequencies, the joint frequencies f12 reported by this module prior to v0.12.016 were incorrect whenever the pre-v0.12.016 list client accessed multiple sub-clients (e.g. $a and $b), and some candidate collocate $v occured together with the target term $w in both sub-clients' indices, but was among the $fudge*$kbest items per epoch for only one of the sub-clients.

The extend() method was re-implemented in v0.12.016 to perform a full profile() on the "missing" candidate collocates, ensuring correct acquisition of both joint and independent collocate frequencies.

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015-2020 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

DiaColloDB::Client(3pm), DiaColloDB(3pm), perl(1), ...