NAME

DiaColloDB::Document::TEI - diachronic collocation db, source document, TEI format

SYNOPSIS

 ##========================================================================
 ## PRELIMINARIES
 
 use DiaColloDB::Document::TEI;
 
 ##========================================================================
 ## Constructors etc.
 
 $doc = CLASS_OR_OBJECT->new(%args);
 
 ##========================================================================
 ## API: I/O: parse
 
 $bool = $doc->fromFile($filename_or_fh, %opts);
 

DESCRIPTION

DiaColloDB::Document::TEI provides a DiaColloDB::Document-compliant API for rudimentary parsing of corpus files in a TEI-like XML format. Input files must be pre-tokenized and represent tokens as <w> elements. Input files may also optionally encode sentence- and/or paragraph-boundaries using <s> and <p> elements, respectively. Fragmented nodes and TEI "linking" attributes are not supported. Although fairly flexible, this document parsing class is very slow and inefficient and is not recommended for production use.

Globals & Constants

Variable: @ISA

DiaColloDB::Document::TEI inherits from DiaColloDB::Document and supports the DiaColloDB::Document API.

Constructors etc.

new
 $doc = CLASS_OR_OBJECT->new(%args);

%args, object structure:

 ##-- parsing options
 tei_ns_$NS      => $uri,  ##-- register namespace URI prefix $NS for user-defined XPaths
 tei_date       => $xpath, ##-- XPath for parsing document date, relative to document root
 tei_meta_$ATTR => $xpath, ##-- XPath for parsing meta-attribute $ATTR, relative to document root
 tei_word_$ATTR => $xpath, ##-- XPath for parsing token-attribute $ATTR, relative to //w element
 tei_break_$BRK => $xpath, ##-- XPath for parsing break nodes, relative to document root
 tei_eos        => $break, ##-- default break-level (default: 's')
 ##
 ##-- document data
 date   =>$date,     ##-- year
 tokens =>\@tokens,  ##-- tokens, including undef for EOS
 meta   =>\%meta,    ##-- document metadata (e.g. author, title, collection, ...)

Each token in @tokens is a HASH-ref {w=>$word,p=>$pos,l=>$lemma,...}, or undef for EOS.

Default options:

 ##-- Namespaces
 tei_ns_tei         => "http://www.tei-c.org/ns/1.0",
 ##
 ##-- Metadata XPaths
 tei_date           => 'teiHeader/fileDesc/publicationStmt/date'
 tei_meta_title     => 'teiHeader/fileDesc/titleStmt/title',
 tei_meta_author    => 'teiHeader/fileDesc/titleStmt/author',
 tei_meta_textClass => 'teiHeader/fileDesc/profileDesc/textClass/classCode',
 ##
 ##-- Token Attribute XPaths
 tei_word_w         => 'text()',
 tei_word_l         => '@lemma',
 tei_word_p         => '@type',
 ##
 ##-- Break-Level XPaths
 tei_break_s        => '//text//s',
 tei_break_p        => '//text//p',
 tei_break_div      => '//text//div',
 tei_break_page     => '//text//pb',

API: I/O: parse

fromFile
 $bool = $doc->fromFile($filename_or_fh, %opts);

parse tokens from $filename_or_fh. %opts: clobbers %$doc.

EXAMPLE

The following is an example file in the format accepted by this module:

 <?xml version="1.0" encoding="UTF-8"?>
 <TEI xmlns="http://www.tei-c.org/ns/1.0">
   <teiHeader>
     <fileDesc>
       <titleStmt>
         <title>test document</title>
         <author>Jurish, Bryan</author>
       </titleStmt>
       <publicationStmt>
         <date>2016-02-25</date>
       </publicationStmt>
       <profileDesc>
         <textClass>
           <classCode>dummy</classCode>
           <classCode>test-data</classCode>
         </textClass>
       </profileDesc>
     </fileDesc>
   </teiHeader>
   <text>
     <p>
       <s>
        <w type="DT" lemma="this">This</w>
        <w type="VBZ" lemma="be">is</w>
        <w type="DT" lemma="a">a</w>
        <w type="NN" lemma="test">test</w>
        <w type="SENT" lemma=".">.</w>
       </s>
       <s>
        <w type="DT" lemma="this">This</w>
        <w type="VBZ" lemma="be">is</w>
        <w type="RB" lemma="only">only</w>
        <w type="DT" lemma="a">a</w>
        <w type="NN" lemma="test">test</w>
        <w type="SENT" lemma=".">.</w>
       </s>
     </p>
     <p>
       <s>
        <w type="DT" lemma="this">This</w>
        <w type="VBZ" lemma="be">is</w>
        <w type="RB" lemma="still">still</w>
        <w type="DT" lemma="a">a</w>
        <w type="NN" lemma="test">test</w>
        <w type="SENT" lemma=".">.</w>
       </s>
     </p>
   </text>
 </TEI>

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2016 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

DiaColloDB::Document(3pm), DiaColloDB(3pm), perl(1), ...