The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DiaColloDB::Corpus::Compiled - collocation db, source corpus (pre-compiled)

SYNOPSIS

 ##========================================================================
 ## PRELIMINARIES
 
 use DiaColloDB::Corpus::Compiled;
 
 ##========================================================================
 ## Constructors etc.
 
 $corpus = $CLASS_OR_OBJECT->new(%args);
 
 ##========================================================================
 ## Persistent API
 
 @keys = $obj->headerKeys();
 @files = $obj->diskFiles();
 $bool = $obj->unlink(%opts);
 
 ##========================================================================
 ## Corpus API
 
 ##-- Corpus API: open/close
 $bool = $corpus->open([$dbdir], %opts);  ##-- compat;
 $bool = $corpus->close();
 
 ##-- Corpus API: iteration
 $nfiles = $corpus->size();
 $bool = $corpus->iok();
 $label = $corpus->ifile();
 $doc_or_undef = $corpus->idocument();
 
 ##========================================================================
 ## Compiled API
 
 $ccorpus = $CLASS_OR_OBJECT->create($src_corpus, %opts);
 $ccorpus = $CLASS_OR_OBJECT->union(\@sources, %opts);
 
 ##========================================================================
 ## Convenience Methods
 
 $bool = $corpus->opened();
 $bool = $corpus->flush();
 $corpus = $corpus->reopen(%opts);
 
 $dirname = $corpus->datadir();
 $bool = $corpus->truncate();
 $filters = $ccorpus->filters();
 

DESCRIPTION

DiaColloDB::Corpus::Compiled is an intermediate abstraction layer for storing pre-filtered corpus data in a format suitable for fast I/O. It should not be necessaray for end users to use this class directly, since the DiaColloDB::create() method should implicitly create a (temporary) DiaColloDB::Corpus::Compiled object whenever required.

Globals & Constants

Variable: @ISA

DiaColloDB::Corpus::Compiled inherited from DiaColloDB::Corpus and supports all DiaColloDB::Corpus methods.

Constructors etc.

new
 $corpus = $CLASS_OR_OBJECT->new(%args);

%args, object structure:

   (
    ##-- NEW in DiaColloDB::Corpus::Compiled
    dbdir   => $dbdir,     ##-- data directory for compiled corpus
    flags   => $flags,     ##-- open mode flags (fcntl flags or perl-style; default='r')
    filters => \%filters,  ##-- corpus filters ( DiaColloDB::Corpus::Filters object or HASH-ref )
    njobs   => $njobs,     ##-- number of parallel worker jobs for create(); default=-1 (= nCores)
    temp    => $bool,      ##-- implicitly unlink() on exit?
    logThreads => $level   ##-- log-level for thread stuff (default='off')
    ##
    ##-- INHERITED from DiaColloDB::Corpus
    #files => \@files,      ##-- source files (OVERRIDE: unused)
    #dclass => $dclass,     ##-- DiaColloDB::Document subclass for loading (OVERRIDE forces 'DiaColloDB::Document::JSON')
    dopts  => \%opts,      ##-- options for $dclass->fromFile() (override default={})
    cur    => $i,          ##-- index of current file
    logOpen => $level,     ##-- log-level for open(); default='info'
   )

Implicitly calls calls the open() method if the dbdir property is defined.

DESTROY

Destructor implicitly calls the close() method, and may also implicitly call unlink() if the temp property is true.

Persistent API

headerKeys
 @keys = $obj->headerKeys();

Override filters out more object-specific keys.

diskFiles
 @files = $obj->diskFiles();

Returns disk storage files; override retuns singleton list $obj->{dbdir}.

 $bool = $obj->unlink(%opts);

Removes all disk file(s) associated with the object. Override accepts additional %opts:

 close => $bool,  ##-- mall $obj->close() before unlinking? (default=1)

Corpus API: open/close

open
 $bool = $corpus->open([$dbdir], %opts);  ##-- compat
 $bool = $corpus->open($dbdir,   %opts);  ##-- new

Opens compiled corpus directory $dbdir, which must be specified as either a simple scalar or a singleton ARRAY-ref, or must already be defined as $corpus->{dbdir} or $opts{dbdir}.

Superclass %opts accepted by DiaColloDB::Corpus:

 compiled => $bool, ##-- implicitly true here
 glob => $bool,     ##-- (ignored here) whether to glob arguments
 list => $bool,     ##-- (ignored here) whether arguments are file-lists
close
 $bool = $corpus->close();

Close currently opened corpus if any. Override implicitly calls $corpus->flush() if $corpus is opened in write-mode.

Corpus API: iteration

size
 $nfiles = $corpus->size();

Returns total number of file(s) in the corpus (constant time).

iok
 $bool = $corpus->iok();

True if corpus file-iterator is valid.

ifile
 $label = $corpus->ifile();
 $label = $corpus->ifile($pos);

Get current iterator filename (first form), or filename at index $pos (second form). Override always returns filenames of the form "$corpus->{dbdir}/$pos.json".

idocument
 $doc_or_undef = $corpus->idocument();
 $doc_or_undef = $corpus->idocument($pos);

Gets current document (first form) or document at index $pos (second form).

Corpus::Compiled API

create
 $ccorpus = $CLASS->create($src_corpus,   %opts);
 $ccorpus = $ccorpus->create($src_corpus, %opts);

Compile or append a single $src_corpus to the compiled corpus directory $opts{dbdir}. If specified %opts, overrides %$ccorpus properties. Returns a (possibly new) DiaColloDB::Corpus::Compiled object $ccorpus. Honors perl- or fcntl-style $opts{flags} for append and truncate.

Parses all document file(s) from $src_corpus, applies the corpus content filters specified by the HASH-ref or DiaColloDB::Corpus::Filters object specified by $ccorpus->{filters}, and saves the compiled data to the compiled corpus directory $ccorpus->{dbdir}. If the threads module is available, compilation may use multiple parallell threads as specified by the $DiaColloDB::NJOBS variable; see DiacolloDB::Utils::nJobs() for details.

union
 $ccorpus = $CLASS->union(\@sources, %opts);
 $ccorpus = $ccorpus->union(\@sources, %opts);

Merges pre-compiled corpora \@sources to the output directory $opts{dbdir}. If specified %opts, overrides %$ccorpus properties. Returns a (possibly new) DiaColloDB::Corpus::Compiled object $ccorpus representing the union over @sources. Honors $ccorpus->{flags} for append and truncate.

Each $src in \@sources is either a DiaColloDB::Corpus::Compiled object or a simple scalar (which is interpreteed as the dbdir of a DiaColloDB::Corpus::Compiled object). No content filters are applied, and output data files are created as links to the input data-files from @sources (hard-links if possible, otherwise symbolic links).

Convenience Methods: disk files etc.

datadir
 $dirname = $corpus->datadir();
 $dirname = $corpus->datadir($dir);

Wrapper for $corpus->{dbdir}.

truncate
 $bool = $corpus->truncate();

Removes all disk data (including header) and resets $corpus->{size} to 0 (zero).

filters
 $filters = $ccorpus->filters();

Return corpus content filters as a DiaColloDB::Corpus::Filters object.

Convenience Methods: open/close

opened
 $bool = $corpus->opened();

Returns true iff $corpus is currently opened.

flush
 $bool = $corpus->flush();

Writes any pending corpus data (e.g. header) to disk.

reopen
 $corpus = $corpus->reopen(%opts);

Closes and re-opened corpus, e.g. with different flags.

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015-2020 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

dcdb-corpus-compile.per(1), dcdb-create.per(1), DiaColloDB::Corpus::Filters(3pm), DiaColloDB::Corpus(3pm), DiaColloDB(3pm), perl(1), ...