The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DiaColloDB::Relation::TDF - collocation db, profiling relation: (term x document) raw-frequency matrix

SYNOPSIS

 ##========================================================================
 ## PRELIMINARIES
 
 use DiaColloDB::Relation::TDF;
 
 ##========================================================================
 ## Constructors etc.
 
 $rel = CLASS_OR_OBJECT->new(%args);
 
 ##========================================================================
 ## TDF API: Utils
 
 $vtype = $rel->vtype();
 $itype = $rel->itype();
 $packas = $rel->vpack();
 $packas = $rel->ipack();
 
 ##========================================================================
 ## Persistent API: disk usage
 
 @files = $rel->diskFiles();
 
 ##========================================================================
 ## Persistent API: header
 
 @keys = $rel->headerKeys();
 $hdr = $rel->headerData();
 
 ##========================================================================
 ## Relation API: open/close
 
 $rel_or_undef = $rel->open($base);
 $rel_or_undef = $rel->close();
 $bool = $rel->opened();
 
 ##========================================================================
 ## Relation API: creation
 
 $rel = $CLASS_OR_OBJECT->create($coldb,$tokdat_file,%opts);
 $rel = CLASS_OR_OBJECT->union($coldb, \@dbargs, %opts);
 
 ##========================================================================
 ## Relation API: info
 
 \%info = $rel->dbinfo($coldb);
 
 ##========================================================================
 ## Relation API: profiling
 
 $mprf   = $rel->profile($coldb, %opts);
 $mprf   = $rel->extend($coldb, %opts);
 $mpdiff = $rel->compare($coldb, %opts);
 
 ##========================================================================
 ## Profile: Utils: PDL-based profiling
 
 $mprf = $rel->vprofile($coldb, \%opts);
 
 ##========================================================================
 ## Profile: Utils: domain sizes
 
 $NT = $rel->nTerms();
 $ND = $rel->nDocs();
 $NC = $rel->nFiles();
 $NA = $rel->nAttrs();
 $NM = $rel->nMeta();
 
 ##========================================================================
 ## Profile: Utils: attribute positioning
 
 \%tpos = $rel->tpos();
 \%mpos = $rel->mpos();
 
 ##========================================================================
 ## Profile: Utils: query parsing & evaluation
 
 $idPdl    = $rel->idpdl($idPdl);
 $tupleIds = $rel->tupleIds($attrType, $attrName, $valIdsPdl);
 $ti       = $rel->termIds($tattrName, $valIdsPDL);
 $ci       = $rel->catIds($mattrName, $valIdsPDL);
 
 $bool          = $rel->hasMeta($attr);
 $enum_or_undef = $rel->metaEnum($mattr);
 
 $cats          = $rel->catSubset($terms);
 
 \%groupby      = $rel->groupby($coldb, $groupby_request, %opts);
 
 ##========================================================================
 ## Relation API: default: query info
 
 \%qinfo = $rel->qinfo($coldb, %opts);

DESCRIPTION

DiaColloDB::Relation::TDF is a DiaColloDB::Relation subclass for document-level co-occurrence frequencies using PDL to efficiently store and query a sparse underlying (term x document) frequency matrix via the PDL::CCS package.

Supports Boolean expressions over both term- and document-level conditions (the latter via DDC #has[ATTRIBUTE,VALUE] or #has[ATTRIBUTE,/REGEX/] syntax) as well as grouping via literal indexed term- and/or document-level attributes.

An earlier version of this module was implemented as DiaColloDB::Relation::Vsem ("vector-space distributional semantic index").

Globals & Constants

Variable: @ISA

DiaColloDB::Relation::TDF inherits from DiaColloDB::Relation.

Constructors etc.

new
 $rel = CLASS_OR_OBJECT->new(%args);

%args, object structure:

 ##-- user options
 base   => $basename,   ##-- relation basename
 flags  => $flags,      ##-- i/o flags (default: 'r')
 mgood  => $regex,      ##-- positive filter regex for metadata attributes
 mbad   => $regex,      ##-- negative filter regex for metadata attributes
 submax => $submax,     ##-- choke on requested tdm cross-subsets if dense subset size ($NT_sub * $ND_sub) > $submax; default=2**29 (512M)
 mquery => \%mquery,    ##-- qinfo templates for meta-fields (default: textClass hack for genre): ($mattr=>$TEMPLATE, ...)
 ##
 ##-- logging options
 logvprofile => $level, ##-- log-level for vprofile() (default=undef:none)
 logio => $level,       ##-- log-level for low-level I/O operations (default=undef:none)
 ##
 ##-- modelling options (formerly via DocClassify)
 minFreq    => $fmin,   ##-- minimum total term-frequency for model inclusion (default=undef:use $coldb->{tfmin})
 minDocFreq => $dfmin,  ##-- minimim "doc-frequency" (#/docs per term) for model inclusion (default=4)
 minDocSize => $dnmin,  ##-- minimum doc size (#/tokens per doc) for model inclusion (default=4; formerly $coldb->{vbnmin})
 maxDocSize => $dnmax,  ##-- maximum doc size (#/tokens per doc) for model inclusion (default=inf; formerly $coldb->{vbnmax})
 vtype      => $vtype,  ##-- PDL::Type for storing compiled values (default=float; auto-promoted if required)
 itype      => $itype,  ##-- PDL::Type for storing compiled integers (default=long)
 ##
 ##-- guts: aux: info
 N => $tdm0Total,       ##-- total number of (doc,term) frequencies counted
 dbreak => $dbreak,     ##-- inherited from $coldb on create()
 ##
 ##-- guts: aux: term-tuples ($NA:number of term-attributes, $NT:number of term-tuples)
 attrs  => \@attrs,       ##-- known term attributes
 tvals  => $tvals,        ##-- pdl($NA,$NT) : [$apos,$ti] => $avali_at_term_ti
 tsorti => $tsorti,       ##-- pdl($NT,$NA) : [,($apos)]  => $tvals->slice("($apos),")->qsorti
 tpos   => \%a2pos,       ##-- term-attribute positions: $apos=$a2pos{$aname}
 ##
 ##-- guts: aux: metadata ($NM:number of metas-attributes, $NC:number of cats (source files))
 meta => \@mattrs         ##-- known metadata attributes
 meta_e_${ATTR} => $enum, ##-- metadata-attribute enum
 mvals => $mvals,         ##-- pdl($NM,$NC) : [$mpos,$ci] => $mvali_at_ci
 msorti => $msorti,       ##-- pdl($NC,$NM) : [,($mpos)]  => $mvals->slice("($mpos),")->qsorti
 mpos  => \%m2pos,        ##-- meta-attribute positions: $mpos=$m2pos{$mattr}
 ##
 ##-- guts: model (formerly via DocClassify dcmap=>$dcmap)
 tdm => $tdm,             ##-- term-doc matrix : PDL::CCS::Nd ($NT,$ND): [$ti,$di] -> f($ti,$di)
 tym => $tym,             ##-- term-year matrix: PDL::CCS::Nd ($NT,$NY): [$ti,$yi] -> f($ti,$yi)
 cf  => $cf_pdl,          ##-- cat-freq pdl:     dense:       ($NC)    : [$ci]     -> f($ci)
 c2date => $c2date,       ##-- cat-dates   : dense ($NC)   : [$ci]   -> $date
 c2d    => $c2d,          ##-- cat->doc map: dense (2,$NC) : [*,$ci] -> [$di_off,$di_len]
 d2c    => $d2c,          ##-- doc->cat map: dense ($ND)   : [$di]   -> $ci
 #...

TDF API: Utils

vtype
 $vtype = $rel->vtype();

get PDL::Type value type for storing compiled values.

itype
 $itype = $rel->itype();

get PDL::Type integer type for storing compiled indices.

vpack
 $packas = $rel->vpack();

pack-template for $rel->vtype(), e.g. "f*"

ipack
 $packas = $rel->ipack();

pack-template for $rel->itype(), e.g. "l*"

Persistent API: disk usage

diskFiles
 @files = $rel->diskFiles();

returns disk storage files, used by du() and timestamp()

Persistent API: header

headerKeys
 @keys = $rel->headerKeys();

keys to save as header; override includes qw(meta attrs vtype itype) and excludes logging and i/o keys.

headerData
 $hdr = $rel->headerData();

returns reference to object header data; override stringifies {itype} and {vtype} keys.

Relation API: open/close

open
 $rel_or_undef = $rel->open($base);
 $rel_or_undef = $rel->open($base,$flags);
 $rel_or_undef = $rel->open();

Opens underlying index files.

close
 $rel_or_undef = $rel->close();

Closes underlying index files.

opened
 $bool = $rel->opened();

Returns true iff index is opened. Really just checks for $rel->{tdm}.

Relation API: creation

create
 $rel = $CLASS_OR_OBJECT->create($coldb,$tokdat_file,%opts);

Populates relation index for $coldb. Requires:

  • (temporary, tied) doc-arrays @$coldb{qw(docmeta docoff)}

  • temp file "$coldb->{dbdir}/vtokens.bin": pack($coldb->{pack_w}, @wattrs)

    OR

    wdmfile=>$wdmfile option

%opts: clobber %$rel, also:

 docmeta =>\@docmeta, ##-- for union(): override $coldb->{docmeta}
                      ##   $docmeta[$ci] = {id=>$id, nsigs=>$nsigs, file=>$rawfile, date=>$date, label=>$label, meta=>\%meta}
 wdmfile =>$wdmfile,  ##-- for union(): txt ~ "$ai0 $ai1 ... $aiN $doci $f"; default is generated from 'vtokens.bin'
 ivalmax =>$imax,     ##-- for union(): maximum integer value (for auto-promotion)
 reusedir=>$bool,     ##-- for union(): set to true if we're running in a "clean" directory
 logas   =>$logas,    ##-- log label (default: 'create()')
union
 $rel = CLASS_OR_OBJECT->union($coldb, \@dbargs, %opts);

merge multiple tdf indices into new object. \@dbargs is an ARRAY-ref of DiaColloDB sub-objects ($coldb,...) containing {tdf} relations to be merged.

%opts: clobber %$rel

Current implementation just creates temp-files utdm0.dat and udocmeta.tmp and then calls create().

Relation API: info

dbinfo
 \%info = $rel->dbinfo($coldb);

embedded info-hash for $coldb->dbinfo()

Relation API: profiling

profile
 $mprf = $rel->profile($coldb, %opts);

Get a relation profile for selected items as a DiaColloDB::Profile::Multi object. %opts are as for DiaColloDB::Relation::profile(). Really just a wrapper for the vprofile() method.

extend

Get independent f2 frequencies for $opts{slice2keys} as a DiaColloDB::Profile::Multi object.

compare
 $mpdiff = $rel->compare($coldb, %opts);

Get a relation comparison profile for selected items as a DiaColloDB::Profile::MultiDiff object. %opts are as for DiaColloDB::Relation::compare(), which this method calls after parsing the groupby option via $rel->groupby($coldb, $opts{groupby}, relax=>0).

Profile: Utils: PDL-based profiling

vprofile
 \@pprfs = $rel->vprofile($coldb, \%opts);

Guts for the profile() method. User options in %opts are as for DiaColloDB::Relation::profile(). Additional keys are populated and used in the course of the computation (so don't set them):

 vq      => $vq,        ##-- parsed query, DiaColloDB::Relation::TDF::Query object
 groubpy => \%groupby,  ##-- as returned by $rel->groupby($coldb, \%opts)
 dlo     => $dlo,       ##-- as returned by $coldb->parseDateRequest(@opts{qw(date slice fill)},1);
 dhi     => $dhi,       ##-- as returned by $coldb->parseDateRequest(@opts{qw(date slice fill)},1);
 dslo    => $dslo,      ##-- as returned by $coldb->parseDateRequest(@opts{qw(date slice fill)},1);
 dshi    => $dshi,      ##-- as returned by $coldb->parseDateRequest(@opts{qw(date slice fill)},1);

Profile: Utils: domain sizes

nTerms
 $NT = $rel->nTerms();

returns number of indexed terms.

nDocs
 $ND = $rel->nDocs();

returns number of indexed documents (breaks).

nFiles
 $NC = $rel->nFiles();

returns number of indexed categories (original source files).

nAttrs
 $NA = $rel->nAttrs();

returns number of indexed term-attributes.

nMeta
 $NM = $rel->nMeta();

returns number of indexed meta-attributes.

Profile: Utils: attribute positioning

tpos
 \%tpos = $rel->tpos();
 $tpos  = $rel->tpos($tattr);

In the first form, get or build the term-attribute position lookup hash. In the second form, get the index position along dimension $NA of the term-attribute named $tattr, or undef if $tattr is not a known term attribute.

mpos
 \%mpos = $rel->mpos();
  $mpos  = $rel->mpos($mattr);

In the first form, get or build the meta-attribute position lookup hash. In the second form, get the index position along dimension $NM of the meta-attribute named $mattr, or undef if $mattr is not a known metadata attribute.

Profile: Utils: query parsing & evaluation

idpdl
 $idPdl = $rel->idpdl($idPdl);
 $idPdl = $rel->idpdl(\@ids);
 $idPdl = $rel->idpdl($id);

Ensure PDL-ness of a set of integer IDs.

tupleIds
 $tupleIds = $rel->tupleIds($attrType, $attrName, $valIds);

Returns a PDL representing the set of index items of type $attrType whose value for the $attrName attribute is contained in the ID-set $valIds, which may be specified in any of the forms accepted by the idpdl() method.

$attrType is either 't' for a term-attribute (in which case the returned $tupleIds are term indices), or 'm' for a metadata attribute (in which case the returned $tupleIds are "category" indices). The returned $tupleIds are always sorted in ascending order.

Could use some optimization.

termIds
 $ti = $rel->termIds($tattrName, $valIds);

wraps $rel->tupleIds('t',$tattrName,$valIds).

catIds
 $ci = $rel->catIds($mattrName, $valIds);

wraps $rel->tupleIds('m',$mattrName,$valIds).

hasMeta
 $bool = $rel->hasMeta($mattr);

returns true iff $rel supports metadata attribute $mattr.

metaEnum
 $enum_or_undef = $rel->metaEnum($mattr);

returns metadata attribute enum for $attr, or undef if $mattr is not supported.

catSubset
 $cats = $rel->catSubset($termIds);
 $cats = $rel->catSubset($termIds,$catIds)

Get a (sorted) cat-subset for the (sorted) term-set $termIds: the set of all "categories" (original source files) which contain at least one instance of any of the terms in $termIds, optionally restricted to the (sorted and unique) set $catIds. The returned category-IDs are sorted and unique.

groupby
 \%groupby = $rel->groupby($coldb, $groupby_request, %opts);
 \%groupby = $rel->groupby($coldb, \%groupby,        %opts);

Modified version of DiaColloDB::groupby() suitable for pdl-ized TDF relation. $grouby_request is as for DiaColloDB::parseRequest(). Returns a HASH-ref:

 ##-- COMPAT: equivalent to DiaColloDB::groupby() return values
 req => $request,    ##-- save request
 areqs => \@areqs,   ##-- parsed attribute requests ([$attr,$ahaving, \%ainfo],...)
                     ##   + new: %ainfo = ( aname=>$enum_name, atype=>$t_or_m, apos=>$apos )
 attrs => \@attrs,   ##-- like $coldb->attrs($groupby_request), modulo "having" parts
 titles => \@titles, ##-- like map {$coldb->attrTitle($_)} @attrs
 ##
 ##-- NEW: for DiaColloDB::Relation::TDF
 how      => $ghow,     ##-- one of  't':groupby terms-only, 'c':groupby cats-only, 'tc':groupby terms+cats
 gatype   => $gatype,   ##-- pdl ($NG)         : attribute types $ai : 0 if $areqs->[$ai] is a term attribute, 1 if meta-attribute
 gapos    => $gapos,    ##-- pdl ($NG)         : term- or meta-attribute position indices $ai : $rel->mpos($attrs[$ai]) or $rel->tpos($attrs[$ai])
 ghavingt => $ghavingt, ##-- pdl ($NHavingTOk) : term indices $ti s.t. $ti matches groupby "having" requests, or undef
 ghavingc => $ghavingc, ##-- pdl ($NHavingCOk) : cat  indices $ci s.t. $ci matches groupby "having" requests, or undef
 g2s      => \&g2s,     ##-- stringification object suitable for DiaColloDB::Profile::stringify() [CODE,enum, or undef]
 gpack    => $packas,   ##-- pack template for groupby-keys

%opts:

 warn  => $level,    ##-- log-level for unknown attributes (default: 'warn')
 relax => $bool,     ##-- allow unsupported attributes (default=0)

Relation API: default: query info

qinfo
 \%qinfo = $rel->qinfo($coldb, %opts);

get query-info hash for profile administrivia (ddc hit links). %opts: as for profile() method. returned hash \%qinfo should have keys:

 fcoef     => $fcoef,     ##-- frequency coefficient (constant 1 here)
 qtemplate => $qtemplate, ##-- query template with __W1.I1__ rsp __W2.I2__ replacing groupby fields

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015-2016 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

DiaColloDB::Relation(3pm), DiaColloDB::Relation::TDF::Query(3pm), DiaColloDB::Relation::Cofreqs(3pm), DiaColloDB::Relation::Unigrams(3pm), DiaColloDB::Relation::DDC(3pm), DiaColloDB(3pm), perl(1), ...