DiaColloDB::Client::list - diachronic collocation db: client: distributed
DiaColloDB::Client::list is a subclass of DiaColloDB::Client for accessing a set of distributed DiaColloDB databases via a list:// URL whose path part is a space- or colon-separated list of sub-URLs supported by DiaColloDB::Client. It supports the DiaColloDB::Client API by calling the relevant methods on each of its sub-clients.
list://
new() options and object structure:
##-- DiaColloDB::Client: options url => $url, ##-- list url (sub-urls, separated by whitespace, "+SCHEME://", or "+://") ## ##-- DiaColloDB::Client::list urls => \@urls, ##-- sub-urls opts => \%opts, ##-- sub-client options fudge => $fudge, ##-- get ($fudge*$kbest) items from sub-clients (-1:all; 0|1:none; default=10) fork => $bool, ##-- run each subclient query in its own fork? (default=if available) lazy => $bool, ##-- use temporary on-demand sub-clients (true,default) or persistent sub-clients (false) extend => $boo, ##-- use extend() queries to acquire correct f2 counts? (default=true) logFudge => $level, ##-- log-level for fudge-coefficient debugging (default='debug') logThread => $level, ##-- log-level for thread operations (default='none') ## ##-- guts #clis => \@clis, ##-- per-url sub-clients for "busy" (non-"lazy") mode
The most important client parameter is the fudge-coefficient option fudge=>$fudge, which requests that up to $fudge*$kbest items be retrieved from sub-clients for each profile() call. If $fudge < 0, all collocates will be retrieved from each sub-client, and trimming will be performed exclusively by the superordinate DiaColloDB::Client::list object. If $fudge == 0, only the $kbest collocates from each sub-client will be retrieved. The default value of 10 should return reasonable results without too large of a performance penalty in most cases, but be aware that the results for $fudge > 0 may not be strictly correct due to sub-client local pruning; see for details.
fudge=>$fudge
$fudge*$kbest
$fudge < 0
$fudge == 0
$kbest
$fudge > 0
This module supports parallel processing of sub-client queries using whatever threading implementation (if any) is provided by the DiaColloDB::threads module. Parallel sub-client processing is enabled by default if a working threads or forks module was found by DiaColloDB::threads, but can be disabled by specifying the fork=>0 option to the list-client.
fork=>0
List URLs passed as the the url option to the constructor can be either ARRAY-refs of sub-URLs or simple strings with an optional list:// scheme. In the latter case, sub-URLs in the argument string are separated by whitespace or by a plus character ("+") followed by the sub-URL scheme, e.g.:
url
["file://a","file://b"] ##-- ARRAY-ref of explicit file URLs ["a" , "b" ] ##-- ARRAY-ref of implicit file URLs "list://file://a file://b" ##-- string with space-separated explicit file URLs "list://a b" ##-- string with space-separated implicit file URLs "list://file://a+file://b" ##-- list with "+"-separated explicit file URLs "list://a+://b" ##-- list with "+"-separated implicit file URLs
Options can be passed to the appropriate sub-URLs via those URLs' query strings, as described in "open" in DiaColloDB::Client. Options to the DiaColloDB::Client::list object itself can be passed in by using a sub-URL consisting of a HASH-ref or only a query string, e.g.:
["a","b",{fudge=>0}] ##-- ARRAY-ref with local options as HASH-ref ["a","b","?fudge=0"] ##-- ARRAY-ref with local options as query-string "list://a b ?fudge=0" ##-- space-sparated string with local options "list://a+://b+://?fudge=0" ##-- "+"-separated string with local options
Prior to the introduction of extend() queries in DiaCollODB v0.11.000, the list-clients were always apt to return incorrrect independent collocate frequencies f2 whenever the queried subcorpora were not partitioned explicitly by date, even with $fudge=-1. Although the reported joint frequencies f12 ought to have been correct in this case, it could easily happen that the independent collocate frequencies f2 got mis-reported, leading to incorrect computations of f2-sensitive association scores such as milf (pointwise mutual information * log-frequency product), ll (log likelihood), or the default ld (log Dice). Such errors occurred whenever the pre-v0.11.000 list client accessed multiple sub-clients (e.g. $a and $b) and some candidate collocate $v occured in both of the subcorpora, but only occured together with the target term $w in one of the sub-clients' indices.
$fudge=-1
milf
ll
ld
$a
$b
$v
$w
Suppose $v occurs in subcorpus $a with frequency f_a($v) and in subcorpus $b with frequency f_b($v), but only occurs together with $w in subcorpus $a with frequency f_a($w,$v); i.e. f_b($w,$v)==0. Since only collocates with nonzero co-occurrence frequencies are collected in subcorpus profiles, the sub-profile for $w over subcorpus $b will not contain an entry for $v at all. This is fine if we are only interested in the total co-occurrence frequency f($w,$v) = f_a($w,$v) + f_b($w,$v), but if we are using an "interesting" association score, we also need to refer to the total independent collocate frequency f($v) = f_a($v) + f_b($v), but since f_b($v) will not have been reported by the subprofile for corpus $b, its value will be treated as 0 (zero), leading to an incorrect estimate of the association score.
f_a($v)
f_b($v)
f_a($w,$v)
f_b($w,$v)==0
f($w,$v) = f_a($w,$v) + f_b($w,$v)
f($v) = f_a($v) + f_b($v)
As of v0.11.000, each list client is queried a second time using its extend() method to acquire independent collocate frequencies for "missing" keys such as $v in the example above. This introduces additional processing overhead, which can be disabled by setting the extend=>0 option to the list-client to simulate the old, incorrect, pre-v0.11 behavior.
extend=>0
Similar to the case for independent collocate frequencies, the joint frequencies f12 reported by this module prior to v0.12.016 were incorrect whenever the pre-v0.12.016 list client accessed multiple sub-clients (e.g. $a and $b), and some candidate collocate $v occured together with the target term $w in both sub-clients' indices, but was among the $fudge*$kbest items per epoch for only one of the sub-clients.
The extend() method was re-implemented in v0.12.016 to perform a full profile() on the "missing" candidate collocates, ensuring correct acquisition of both joint and independent collocate frequencies.
Bryan Jurish <moocow@cpan.org>
Copyright (C) 2015-2020 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.
DiaColloDB::Client(3pm), DiaColloDB(3pm), perl(1), ...
To install DiaColloDB, copy and paste the appropriate command in to your terminal.
cpanm
cpanm DiaColloDB
CPAN shell
perl -MCPAN -e shell install DiaColloDB
For more information on module installation, please visit the detailed CPAN module installation guide.