The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DTA::CAB::Analyzer - generic analyzer API

SYNOPSIS

 use DTA::CAB::Analyzer;
 
 ##========================================================================
 ## Constructors etc.
 
 $obj = $CLASS_OR_OBJ->new(%args);
 undef = $anl->initialize();
 undef = $anl->dropClosures();
 $label = $anl->defaultLabel();
 $class = $anl->analysisClass();
 @keys = $anl->typeKeys(\%opts);
 
 ##========================================================================
 ## Methods: I/O
 
 $bool = $anl->ensureLoaded();
 $bool = $anl->prepare();
 
 ##========================================================================
 ## Methods: Persistence: Perl
 
 @keys = $class_or_obj->noSaveKeys();
 $loadedObj = $CLASS_OR_OBJ->loadPerlRef($ref);
 
 @keys = $class_or_obj->noSaveBinKeys();
 $loadedObj = $CLASS_OR_OBJ->loadBinRef($ref);
 
 ##========================================================================
 ## Methods: Analysis: Utils
 
 $bool = $anl->canAnalyze();
 $bool = $anl->doAnalyze(\%opts, $name);
 $bool = $anl->enabled(\%opts);
 $bool = $anl->autoEnable();
 undef = $anl->initInfo();
 \@analyzers = $anl->subAnalyzers();
 
 ##========================================================================
 ## Methods: Analysis: API
 
 $doc = $anl->analyzeDocument($doc,\%opts);
 $doc = $anl->analyzeTypes($doc,\%types,\%opts);
 $doc = $anl->analyzeTokens($doc,\%opts);
 $doc = $anl->analyzeSentences($doc,\%opts);
 $doc = $anl->analyzeLocal($doc,\%opts);
 $doc = $anl->analyzeClean($doc,\%opts);
 
 ##========================================================================
 ## Methods: Analysis: Type-wise
 
 \%types = $anl->getTypes($doc);
 $doc = $anl->expandTypes($doc,\%types,\%opts);
 $doc = $anl->clearTypes($doc);
 
 ##========================================================================
 ## Methods: Analysis: Wrappers
 
 $tok = $anl->analyzeToken($tok_or_string,\%opts);
 $tok = $anl->analyzeSentence($sent_or_array,\%opts);
 $rpc_xml_base64 = $anl->analyzeData($data_str,\%opts);
 
 ##========================================================================
 ## Methods: Analysis: Closure Utilities
 
 \&closure = $anl->analyzeClosure($which);
 \&closure = $anl->getAnalyzeClosure($which);
 
 $closure = $anl->accessClosure( $methodName);

 PACKAGE::_am_xlit($tokvar);
 PACKAGE::_am_lts($tokvar);
 PACKAGE::_am_tt_list($ttvar);
 PACKAGE::_am_tt_fst($ttvar);
 PACKAGE::_am_id_fst($tokvar, $wvar);
 PACKAGE::_am_tt_fst_list($ttvar);
 PACKAGE::_am_fst_sort($listvar);
 PACKAGE::_am_fst_clean($hashvar);
 
 ##========================================================================
 ## Methods: XML-RPC
 
 \%opts = $anl->mergeOptions(\%defaultOptions,\%userOptions);
 @procedures = $anl->xmlRpcMethods();
 

DESCRIPTION

DTA::CAB::Analyzer is an abstract class and API specification for representing arbitrary semi-independent document analysis algorithms. Each analyzer sub-class should define at least one of the analyzeXYZ() methods (analyzeTypes(), analyzeTokens(), etc.), and each analyzer instance should set a 'name' key. Analyzer objects are assumed to be HASH refs, and should define at least a 'label' key to identify the analyzer object e.g. in a multi-analyzer processing chain.

DTA::CAB::Analyzer inherits from DTA::CAB::Persistent (and thus indirectly from DTA::CAB::Logger), and provides some basic hooks for extending the DTA::CAB::Persistent functionality. These routines are especially useful e.g. for defining analyzer parameters in a configuration file which can be passed to the dta-cab-analyze.perl comman-line script via the "-config" option.

See DTA::CAB::Analyzer::Common for a list of common analyzer sub-classes.

See DTA::CAB::Chain for an abstract analyzer class representing simple linear analysis chains (aka "pipelines"), and see DTA::CAB::Chain::Multi for an abstract analyzer class representing a set of named analysis pipelines. Since analysis chains are themselves implemented as subclasses of DTA::CAB::Analyzer, analysis chains may be nested to arbitrary depth (at least in theory).

Constructors etc.

new
 $obj = CLASS_OR_OBJ->new(%args);

%$obj, %args:

 label => $label,    ##-- analyzer label (default: from class name)
 aclass => $class,   ##-- analysis class (optional; see $anl->analysisClass() method; default=undef)
 typeKeys => \@keys, ##-- analyzer type keys for $anl->typeKeys()
 enabled => $bool,   ##-- set to false, non-undef value to disable this analyzer
 initQuiet => $bool, ##-- if true, initInfo() will not print any output
initialize
 undef = $anl->initialize();

Initialize the analyzer. Default implementation does nothing

dropClosures
 undef = $anl->dropClosures();

OBSOLETE: drops '_analyze*' closures. This method is a relic of an obsolete API, and should go away. The method name is still used with (basically) its original semantics by the (unmaintained) subclass DTA::CAB::Analyzer::Dyn.

Currently does nothing.

defaultLabel
 $label = $anl->defaultLabel();

Returns default label for this class. Default implementation returns the final segment of the Perl class-name.

analysisClass
 $class = $anl->analysisClass();

DEPRECATED: Gets cached $anl->{aclass} if exists, otherwise returns undef. Really just an ugly wrapper for $anl->{aclass}.

This method is an (unused) relic of an abandoned attempt to force all analysis outputs to be bless()ed Perl objects. Try to avoid it.

typeKeys
 @keys = $anl->typeKeys(\%opts);

Returns list of type-wise keys to be expanded for this analyzer by expandTypes(). Default returns @{$anl->{typeKeys}} if defined, otherwise ($anl->{label}).

The default is really annoying and potentially dangerous if you're not writing a type-wise analyzer, but most of the current analyzers do operate type-wise, so it was convenient. Override if necessary.

Methods: I/O

ensureLoaded
 $bool = $anl->ensureLoaded();
 $bool = $anl->ensureLoaded(\%opts);

Ensures analysis data is loaded from default files, or that no data is available to be loaded. Should return false only if user has requested data to be loaded and that data cannot be loaded. "Empty" analyzers should return true here.

Default implementation always returns true.

This method is poorly named, and almost entirely useless, since some analyzers require it to be called very early, before other potentially relevant options have been evaluated. Returning false here may cause a host application (e.g. dta-cab-analyze.perl) to die(). Such behavior may not be desirable however if no analysis source data (e.g. dictionary files) was found (perhaps because it was undefined); see the canAnalyze() and autoDisable() methods for workarounds.

prepare
 $bool = $anl->prepare();
 $bool = $anl->prepare(\%opts)

Wrapper for ensureLoaded(), autoEnable(), initInfo(). Should probably replace top-level calls to ensureLoaded() in host applications.

Methods: Persistence

noSaveKeys
 @keys = $class_or_obj->noSaveKeys();

Returns list of keys not to be saved. Default implementation just greps for CODE-refs.

loadPerlRef
 $loadedObj = $CLASS_OR_OBJ->loadPerlRef($ref);

Default implementation just clobbers $CLASS_OR_OBJ with $ref and blesses.

noSaveBinKeys
 @keys = $class_or_obj->noSaveBinKeys();

Returns list of keys not to be saved for binary mode Default just greps for CODE-refs.

loadBinRef
 $loadedObj = $CLASS_OR_OBJ->loadBinRef($ref);

Implicitly calls $OBJ->dropClosures().

Methods: Analysis: Utils

canAnalyze
 $bool = $anl->canAnalyze();
 $bool = $anl->canAnalyze(\%opts);

Returns true iff analyzer can perform its function (e.g. data is loaded & non-empty). Default implementation always returns true.

doAnalyze
 $bool = $anl->doAnalyze(\%opts, $name);

Alias for $anl->can("analyze${name}") && (!exists($opts{"doAnalyze${name}"}) || $opts{"doAnalyze${name}"}).

enabled
 $bool = $anl->enabled(\%opts);

Returns true if analyzer SHOULD operate, acording to %opts. Default returns:

 (!defined($anl->{enabled}) || $anl->{enabled})                           ##-- globally enabled
 &&
 (!$opts || !defined($opts{"${lab}_enabled"} || $opts{"${lab}_enabled"})  ##-- ... and locally enabled
autoEnable
 $bool = $anl->autoEnable();
 $bool = $anl->autoEnable(\%opts);

Sets $anl->{enabled} flag if not already defined. Calls $anl->canAnalyze(\%opts). Returns new value of $anl->{enabled}. Implicitly calls autoEnable() on all sub-analyzers.

autoDisable

Alias for autoEnable().

initInfo
 undef = $anl->initInfo();

Logs initialization info. Default method reports values of {label}, enabled(). Sets $anl->{initQuiet}=1 (don't report multiple times).

subAnalyzers
 \@analyzers = $anl->subAnalyzers();
 \@analyzers = $anl->subAnalyzers(\%opts)

Returns a list of all sub-analyzers for this object. Default returns all DTA::CAB::Analyzer subclass instances in values(%$anl).

Methods: Analysis: API

analyzeDocument
 $doc = $anl->analyzeDocument($doc,\%opts);

Top-level API routine: analyze a DTA::CAB::Document $doc. Default implementation just calls:

 $doc = toDocument($doc);
 if ($anl->doAnalyze('Types')) {
   $types = $anl->getTypes($doc);
   $anl->analyzeTypes($doc,$types,\%opts);
   $anl->expandTypes($doc,$types,\%opts);
   $anl->clearTypes($doc);
 }
 $anl->analyzeTokens($doc,\%opts)    if ($anl->doAnalyze(\%opts,'Tokens'));
 $anl->analyzeSentences($doc,\%opts) if ($anl->doAnalyze(\%opts,'Sentences'));
 $anl->analyzeLocal($doc,\%opts)     if ($anl->doAnalyze(\%opts,'Local'));
 $anl->analyzeClean($doc,\%opts)     if ($anl->doAnalyze(\%opts,'Clean'));
analyzeTypes
 $doc = $anl->analyzeTypes($doc,\%types,\%opts);

Perform type-wise analysis of all (text) types in \%types (default is $doc->{types}). Default implementation does nothing.

analyzeTokens
 $doc = $anl->analyzeTokens($doc,\%opts);

Perform token-wise analysis of all tokens $doc->{body}[$si]{tokens}[$wi]. Default implementation does nothing.

analyzeSentences
 $doc = $anl->analyzeSentences($doc,\%opts);

Perform sentence-wise analysis of all sentences $doc->{body}[$si]. Default implementation does nothing.

analyzeLocal
 $doc = $anl->analyzeLocal($doc,\%opts);

Perform analyzer-local document-level analysis of $doc. Default implementation does nothing.

analyzeClean
 $doc = $anl->analyzeClean($doc,\%opts);

Cleanup any temporary data associated with $doc. Default implementation does nothing.

Methods: Analysis: Type-wise

getTypes
 \%types = $anl->getTypes($doc);

Returns a hash

 \%types = ($typeText => $typeToken, ...)

mapping token text to basic token objects (with only 'text' key defined). Default implementation just calls $doc->types().

expandTypes
 $doc = $anl->expandTypes($doc,\%types,\%opts);

Expands \%types into $doc->{body} tokens. Default implementation just calls $doc->expandTypeKeys(\@typeKeys,\%types), where \@typeKeys is derived from $anl->typeKeys().

clearTypes
 $doc = $anl->clearTypes($doc);

Clears cached type->object map in $doc->{types}. Default just calls $doc->clearTypes().

Methods: Analysis: Wrappers

analyzeToken
 $tok = $anl->analyzeToken($tok_or_string,\%opts);

Compatibility wrapper: perform type- and token- analyses on $tok_or_string. Really just a wrapper for $anl->analyzeDocument().

analyzeSentence
 $tok = $anl->analyzeSentence($sent_or_array,\%opts);

Compatibility wrapper: perform type- and token-, and sentence- analyses on $sent_or_array. Really just a wrapper for $anl->analyzeDocument().

analyzeData
 $rpc_xml_base64 = $anl->analyzeData($data_str,\%opts);

Analyze a raw (formatted) data string $data_str with internal parsing & formatting. Really just a wrapper for $anl->analyzeDocument().

Methods: Analysis: Closure Utilities (optional)

analyzeClosure
 \&closure = $anl->analyzeClosure($which);

Optional utility for closure-based analysis. Returns cached $anl->{"_analyze${which}"} if present; otherwise calls $anl->getAnalyzeClosure($which) & caches result.

getAnalyzeClosure
 \&closure = $anl->getAnalyzeClosure($which);

Returns closure \&closure for analyzing data of type "$which" (e.g. Word, Type, Token, Sentence, Document, ...). Default implementation calls $anl->getAnalyze"${which}"Closure() if available, otherwise croak()s.

accessClosure
 $closure = $anl->accessClosure(\&codeRef,    %opts);
 $closure = $anl->accessClosure( $methodName, %opts);
 $closure = $anl->accessClosure( $codeString, %opts);

Returns accessor-closure $closure for $anl. Passed argument can be one of the following:

$codeRef

a CODE ref resolves to itself

$methodName

a method name resolves to $anl->can($methodName)

$codeString

any other string resolves to 'sub { $codeString }'; which may reference the closure variable $anl

Additional options for $codeString pseudo-accessors can be passed in %opts:

 pre => $prefix,     ##-- compiles as "${prefix}; sub {$code}"
 vars => \@vars,     ##-- compiles as 'my ('.join(',',@vars).'); '."sub {$code}"

Methods: Analysis: Closure Utilities: Macros

In order to facilitate development of analyzer-local accessor code in string form, the following "macros" are defined as exportable functions. Their arguments and return values are strings suitable for inclusion in acccessor macros. These macros are exported by the tags ':access', ':child', and ':all'.

_am_xlit
 PACKAGE::_am_xlit($tokvar='$_');

access-closure macro: get xlit or text for token $$tokvar; evaluates to a string: ($$tokvar->{xlit} ? $$tokvar->{xlit}{latin1Text} : $$tokvar->{text})

_am_lts
 PACKAGE::_am_lts($tokvar='$_');

access-closure macro for first LTS analysis of token $$tokvar; evaluates to string: ($$tokvar->{lts} && @{$$tokvar->{lts}} ? $$tokvar->{lts}[0]{hi} : $$tokvar->{text})

_am_tt_list
 PACKAGE::_am_tt_list($ttvar='$_');

access-closure macro for a TT-style list of strings $$ttvar; evaluates to a list: split(/\\t/,$$ttvar)

_am_tt_fst
 PACKAGE::_am_tt_fst($ttvar='$_');

(formerly mutliply defined in sub-packages as SUBPACKAGE::parseFstString())

access-closure macro for a single TT-style FST analysis $$ttvar; evaluates to a FST-analysis hash {hi=>$hi,w=>$w,lo=>$lo,lemma=>$lemma}:

    (
     $$ttvar =~ /^(?:(.*?) \: )?(?:(.*?) \@ )?(.*?)(?: \<([\d\.\+\-eE]+)\>)?$/
     ? {(defined($1) ? (lo=>$1) : qw()), (defined($2) ? (lemma=>$2) : qw()), hi=>$3, w=>($4||0)}
     : {hi=>$$ttvar}
    )
_am_id_fst
 PACKAGE::_am_id_fst($tokvar='$_', $wvar='0');

access-closure macro for a identity FST analysis; evaluates to a single fst analysis hash: {hi=>_am_xlit($tokvar), w=>$$wvar}

_am_tt_fst_list
 PACKAGE::_am_tt_fst_list($ttvar='$_');

access-closure macro for a list of TT-style FST analyses $$ttvar; evaluates to a list of fst analysis hashes: (map {_am_tt_fst('$_')} split(/\t/,$$ttvar))

_am_tt_fst_eqlist
 PACKAGE::_am_tt_fst_eqlist($ttvar='$tt', $tokvar='$_', $wvar='0');

access-closure macro for a list of TT-style FST analyses $$ttvar; evaluates to a list of fst analysis hashes: (_am_id_fst($tokvar,$wvar), _am_tt_fst_list($ttvar))

_am_fst_sort
 PACKAGE::_am_fst_sort($listvar='@_');

access-closure macro to sort a list of FST analyses $$listvar by weight; evaluates to a sorted list of fst analysis hashes: (sort {($a->{w}||0) <=> ($b->{w}||0) || ($a->{hi}||"") cmp ($b->{hi}||"")} $$listvar)

_am_fst_clean
 PACKAGE::_am_fst_clean($hashvar='$_->{$lab}');

access-closure macro to delete undefined hash entries; evaluates to: delete($$hashvar) if (!defined($$hashvar));

Methods: XML-RPC

mergeOptions
 \%opts = $anl->mergeOptions(\%defaultOptions,\%userOptions);

Returns options hash like (%defaultOptions,%userOptions) [user clobbers default].

xmlRpcMethods
 @procedures = $anl->xmlRpcMethods();
 @procedures = $anl->xmlRpcMethods($prefix,\%opts);
  • returns a list of procedures suitable for passing to RPC::XML::Server::add_proc()

  • additional keys recognized in procedure specs: see DTA::CAB::Server::XmlRpc::prepareLocal()

  • "${prefix}." is appended to procedure 'name' key if $prefix is specified

  • \%opts are passed to analyze methods if defined

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2009-2019 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

dta-cab-analyze.perl(1), DTA::CAB::Analyzer::Common(3pm), DTA::CAB::Chain(3pm), DTA::CAB::Chain::Multi(3pm), DTA::CAB::Format(3pm), DTA::CAB(3pm), perl(1), ...