The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DiaColloDB::Document - diachronic collocation db, source document (base class)

SYNOPSIS

 ##========================================================================
 ## PRELIMINARIES
 
 use DiaColloDB::Document;
 
 ##========================================================================
 ## Constructors etc.
 
 $doc = CLASS_OR_OBJECT->new(%args);
 
 ##========================================================================
 ## API: I/O
 
 $bool = $doc->fromFile($filename_or_fh);
 $label = $doc->label();
 

DESCRIPTION

DiaColloDB::Document provides an abstract base-class for corpus documents from which a DiaColloDB database can be created. Support for alternative corpus formats can be be added by implementing a DiaColloDB::Document subclass for each required format.

Globals & Constants

Variable: @ISA

DiaColloDB::Document inherits from DiaColloDB::Logger.

Constructors etc.

new
 $doc = CLASS_OR_OBJECT->new(%args);

%args, object structure:

 label  => $label,   ##-- document label (e.g. filename; optional)
 date   =>$date,     ##-- year
 tokens =>\@tokens,  ##-- tokens, including undef for eos
 meta   =>\%meta,    ##-- document metadata (e.g. author, title, collection, ...)

Each token in @tokens is one of the following:

 undef              : EOS (default, for collocation profiling)
 a HASH-ref         : normal token: {w=>$word,p=>$pos,l=>$lemma,...}
 a string "#BREAK"  : block boundary / "break" of type BREAK, e.g. "#s": sentence-break, "#p": paragraph-break, ...

API: I/O

fromFile
 $bool = $doc->fromFile($filename_or_fh);

parse tokens from $filename_or_fh

label
 $label = $doc->label();

return a string label for $doc; default just returns "$doc".

SUBCLASSES

The DiaColloDB distribution provides the following built-in DiaColloDB::Document subclasses:

DiaColloDB::Document::DDCTabs

Full support for DDC tab-dump files as produced by ddc_dump --full --tabs; see http://odo.dwds.de/~moocow/software/ddc/ddc_tabs.html.

DiaColloDB::Document::JSON

Supports input files in JSON format, assuming the stored JSON data maps 1:1 onto the required DiaColloDB::Document structure described above under new().

DiaColloDB::Document::TCF

Basic handling for input files in CLARIN-D TCF format as used by WebLicht.

DiaColloDB::Document::TEI

Rudimentary handling for TEI-like XML input files, which must at least include token boundaries encoded as <w> elements.

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015-2020 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

DiaColloDB::Document::DDCTabs(3pm), DiaColloDB::Document::JSON(3pm), DiaColloDB::Document::TCF(3pm), DiaColloDB::Document::TEI(3pm), DiaColloDB::Corpus(3pm), DiaColloDB(3pm), perl(1), ...