The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DiaColloDB::Corpus::Filters - collocation db, source corpus content filters

SYNOPSIS

 ##========================================================================
 ## PRELIMINARIES
 
 use DiaColloDB::Corpus::Filters;
 
 ##========================================================================
 ## Methods
 
 $filters = $CLASS_OR_OBJECT->new(%opts);
 $filters = $CLASS_OR_OBJECT->null();
 $filters = $filters->clear();
 $bool = $filters1->equal($filters2);
 \%name2obj = $filters->compile();
 \%line2undef = $coldb->loadListFile($filename_or_undef);
 

DESCRIPTION

DiaColloDB::Corpus::Filters is a class representing corpus content filters (e.g. stopword lists and regular expressions) used by DiaColloDB::Corpus::Compiled and implicitly by the DiaColloDB::create()|DiaColloDB/create method as called by the top-level command-line utility dcdb-corpus-create.perl(1).

Administrivia

Variable: @ISA

DiaColloDB::Corpus::Filters inherits from DiaColloDB::Persistent. It also uses Exporter for compatibility with older versions of the DiaColloDB distribution in which the package-global default variables resided directly in the DiaColloDB package itself.

Defaults

(formerly defined in DiaColloDB.pm)

Don't use qr// for regex defaults, because Storable doesn't like pre-compiled Regexps.

Variable: $PGOOD_DEFAULT

Default positive PoS-regex for document parsing. Default = q/^(?:N|TRUNC|VV|ADJ)/.

Variable: $PBAD_DEFAULT

Default negative PoS-regex for document parsing. Default = undef (none).

Variable: $WGOOD_DEFAULT

Default positive word regex for document parsing. Default = q/[[:alpha:]]/

Variable: $WBAD_DEFAULT

Default negative word regex for document parsing. Default = q/[\.]/.

Variable: $LGOOD_DEFAULT

Default positive lemma regex for document parsing. Default = undef (none).

Variable: $LBAD_DEFAULT

Default negative lemma regex for document parsing. Default = undef (none).

Methods

new
 $filters = $CLASS_OR_OBJECT->new(%opts);

Returns a new DiaColloDB::Corpus::Filters object, which is a simple HASH-ref wrapping %opts:

 ##-- part-of-speech filters
 pgood     => $re,    ##-- PoS whitelist regex
 pgoodfile => $file,  ##-- PoS whitelist filename
 pbad      => $re,    ##-- PoS blacklist regex
 pbadfile  => $file,  ##-- PoS blacklist filename
 
 ##-- word surface text filters
 wgood     => $re,    ##-- word whitelist regex
 wgoodfile => $file,  ##-- word whitelist filename
 wbad      => $re,    ##-- word blacklist regex
 wbadfile  => $file,  ##-- word blacklkist filename (= "stopword list")
 
 ##-- lemma filters
 lgood     => $re,    ##-- lemma whitelist regex
 lgoodfile => $file,  ##-- lemma whitelist filename
 lbad      => $re,    ##-- lemma blacklist regex
 lbadfile  => $file,  ##-- lemma blacklist filename

See "Defaults" for the default values.

null
 $filters = $CLASS_OR_OBJECT->null();

Returns a new DiaColloDB::Corpus::Filters object representing a "null-filter", i.e. with all filter properties undefined.

clear
 $filters = $filters->clear();

Deletes all filter properties (white- and blacklist regexes and filenames) from the $filters object.

isnull
 $bool = $filters->isnull();

Returns true iff $filters does not define any supported filter properties at all (i.e. application of $filters would be a no-op).

equal
 $bool = $filters1->equal($filters2);
 $bool = $CLASS->equal($filters1,$filters2)

Returns true iff filter object operands define the all and only the same supported filter properties with identical values.

compile
 \%name2obj = $filters->compile();
 \%name2obj = $CLASS->compile(\%filters);

Returns a HASH-ref of compiled filter regexes and (stop|go)-hashes of the form

 ${NAME}     => $REGEXP,
 ${NAME}file => \%HASHREF,
loadListFile
 \%line2undef = $coldb->loadListFile($filename_or_undef);

Low-level utility method used to load (stop|go)-list files.

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015-2020 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

dcdb-corpus-compile.per(1), DiaColloDB::Corpus::Compiled(3pm), DiaColloDB(3pm), perl(1), ...