DiaColloDB::Document::TEI - diachronic collocation db, source document, TEI format
##======================================================================== ## PRELIMINARIES use DiaColloDB::Document::TEI; ##======================================================================== ## Constructors etc. $doc = CLASS_OR_OBJECT->new(%args); ##======================================================================== ## API: I/O: parse $bool = $doc->fromFile($filename_or_fh, %opts);
DiaColloDB::Document::TEI provides a DiaColloDB::Document-compliant API for rudimentary parsing of corpus files in a TEI-like XML format. Input files must be pre-tokenized and represent tokens as <w> elements. Input files may also optionally encode sentence- and/or paragraph-boundaries using <s> and <p> elements, respectively. Fragmented nodes and TEI "linking" attributes are not supported. Although fairly flexible, this document parsing class is very slow and inefficient and is not recommended for production use.
<w>
<s>
<p>
DiaColloDB::Document::TEI inherits from DiaColloDB::Document and supports the DiaColloDB::Document API.
$doc = CLASS_OR_OBJECT->new(%args);
%args, object structure:
##-- parsing options tei_ns_$NS => $uri, ##-- register namespace URI prefix $NS for user-defined XPaths tei_date => $xpath, ##-- XPath for parsing document date, relative to document root tei_meta_$ATTR => $xpath, ##-- XPath for parsing meta-attribute $ATTR, relative to document root tei_word_$ATTR => $xpath, ##-- XPath for parsing token-attribute $ATTR, relative to //w element tei_break_$BRK => $xpath, ##-- XPath for parsing break nodes, relative to document root tei_eos => $break, ##-- default break-level (default: 's') ## ##-- document data date =>$date, ##-- year tokens =>\@tokens, ##-- tokens, including undef for EOS meta =>\%meta, ##-- document metadata (e.g. author, title, collection, ...)
Each token in @tokens is a HASH-ref {w=>$word,p=>$pos,l=>$lemma,...}, or undef for EOS.
Default options:
##-- Namespaces tei_ns_tei => "http://www.tei-c.org/ns/1.0", ## ##-- Metadata XPaths tei_date => 'teiHeader/fileDesc/publicationStmt/date' tei_meta_title => 'teiHeader/fileDesc/titleStmt/title', tei_meta_author => 'teiHeader/fileDesc/titleStmt/author', tei_meta_textClass => 'teiHeader/fileDesc/profileDesc/textClass/classCode', ## ##-- Token Attribute XPaths tei_word_w => 'text()', tei_word_l => '@lemma', tei_word_p => '@type', ## ##-- Break-Level XPaths tei_break_s => '//text//s', tei_break_p => '//text//p', tei_break_div => '//text//div', tei_break_page => '//text//pb',
$bool = $doc->fromFile($filename_or_fh, %opts);
parse tokens from $filename_or_fh. %opts: clobbers %$doc.
The following is an example file in the format accepted by this module:
<?xml version="1.0" encoding="UTF-8"?> <TEI xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader> <fileDesc> <titleStmt> <title>test document</title> <author>Jurish, Bryan</author> </titleStmt> <publicationStmt> <date>2016-02-25</date> </publicationStmt> <profileDesc> <textClass> <classCode>dummy</classCode> <classCode>test-data</classCode> </textClass> </profileDesc> </fileDesc> </teiHeader> <text> <p> <s> <w type="DT" lemma="this">This</w> <w type="VBZ" lemma="be">is</w> <w type="DT" lemma="a">a</w> <w type="NN" lemma="test">test</w> <w type="SENT" lemma=".">.</w> </s> <s> <w type="DT" lemma="this">This</w> <w type="VBZ" lemma="be">is</w> <w type="RB" lemma="only">only</w> <w type="DT" lemma="a">a</w> <w type="NN" lemma="test">test</w> <w type="SENT" lemma=".">.</w> </s> </p> <p> <s> <w type="DT" lemma="this">This</w> <w type="VBZ" lemma="be">is</w> <w type="RB" lemma="still">still</w> <w type="DT" lemma="a">a</w> <w type="NN" lemma="test">test</w> <w type="SENT" lemma=".">.</w> </s> </p> </text> </TEI>
Bryan Jurish <moocow@cpan.org>
Copyright (C) 2016-2020 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.
DiaColloDB::Document(3pm), DiaColloDB(3pm), perl(1), ...
To install DiaColloDB, copy and paste the appropriate command in to your terminal.
cpanm
cpanm DiaColloDB
CPAN shell
perl -MCPAN -e shell install DiaColloDB
For more information on module installation, please visit the detailed CPAN module installation guide.