NAME

DiaColloDB::Document::TCF - diachronic collocation db, source document, TCF format

SYNOPSIS

 ##========================================================================
 ## PRELIMINARIES
 
 use DiaColloDB::Document::TCF;
 
 ##========================================================================
 ## Constructors etc.
 
 $doc = CLASS_OR_OBJECT->new(%args);
 
 ##========================================================================
 ## API: I/O: parse
 
 $bool = $doc->fromFile($filename_or_fh, %opts);
 

DESCRIPTION

DiaColloDB::Document::TCF provides a DiaColloDB::Document-compliant API for parsing corpus files in the CLARIN-D TCF format. It currently supports only the tokens, sentences, postags, and lemmas TCF layers.

Globals & Constants

Variable: @ISA

DiaColloDB::Document::TCF inherits from DiaColloDB::Document and supports the DiaColloDB::Document API.

Constructors etc.

new
 $doc = CLASS_OR_OBJECT->new(%args);

%args, object structure:

 ##-- document data
 date   =>$date,     ##-- year
 tokens =>\@tokens,  ##-- tokens, including undef for EOS
 meta   =>\%meta,    ##-- document metadata (e.g. author, title, collection, ...)
                     ##   + parsed from /D-Spin/MetaData/source[@type] for $type !~ /^meta:ATTR/

Each token in @tokens is a HASH-ref {w=>$word,p=>$pos,l=>$lemma,...}, or undef for EOS.

API: I/O: parse

fromFile
 $bool = $doc->fromFile($filename_or_fh, %opts);

parse tokens from $filename_or_fh. %opts: clobbers %$doc.

Metadata are parsed from any TCF MetaData/source elements whose @type attribute begins with the string "meta:".

EXAMPLE

The following is an example file in the format accepted by this module:

 <?xml version="1.0" encoding="UTF-8"?>
 <D-Spin xmlns="http://www.dspin.de/data">
   <MetaData xmlns="http://www.dspin.de/data/metadata">
     <source type="meta:collection">tiny</source>
     <source type="meta:author">Jurish, Bryan</source>
     <source type="meta:genre">dummy</source>
     <source type="meta:date">2016</source>
     <source type="meta:date_">2016-02-25</source>
     <source type="meta:textClass">dummy:test-data</source>
     <source type="meta:title">test document</source>
   </MetaData>
   <TextCorpus xmlns="http://www.dspin.de/data/textcorpus">
     <tokens>
       <token ID="w0">This</token>
       <token ID="w1">is</token>
       <token ID="w2">a</token>
       <token ID="w3">test</token>
       <token ID="w4">.</token>
       <token ID="w5">This</token>
       <token ID="w6">is</token>
       <token ID="w7">only</token>
       <token ID="w8">a</token>
       <token ID="w9">test</token>
       <token ID="w10">.</token>
       <token ID="w11">This</token>
       <token ID="w12">is</token>
       <token ID="w13">still</token>
       <token ID="w14">a</token>
       <token ID="w15">test</token>
       <token ID="w16">.</token>
     </tokens>
     <sentences>
       <sentence ID="s0" tokenIDs="w0 w1 w2 w3 w4"/>
       <sentence ID="s1" tokenIDs="w5 w6 w7 w8 w9 w10"/>
       <sentence ID="s2" tokenIDs="w11 w12 w13 w14 w15 w16"/>
     </sentences>
     <lemmas>
       <lemma tokenIDs="w0">this</lemma>
       <lemma tokenIDs="w1">be</lemma>
       <lemma tokenIDs="w2">a</lemma>
       <lemma tokenIDs="w3">test</lemma>
       <lemma tokenIDs="w4">.</lemma>
       <lemma tokenIDs="w5">this</lemma>
       <lemma tokenIDs="w6">be</lemma>
       <lemma tokenIDs="w7">only</lemma>
       <lemma tokenIDs="w8">a</lemma>
       <lemma tokenIDs="w9">test</lemma>
       <lemma tokenIDs="w10">.</lemma>
       <lemma tokenIDs="w11">this</lemma>
       <lemma tokenIDs="w12">be</lemma>
       <lemma tokenIDs="w13">still</lemma>
       <lemma tokenIDs="w14">a</lemma>
       <lemma tokenIDs="w15">test</lemma>
       <lemma tokenIDs="w16">.</lemma>
     </lemmas>
     <POStags>
       <tag tokenIDs="w0">DT</tag>
       <tag tokenIDs="w1">VBZ</tag>
       <tag tokenIDs="w2">DT</tag>
       <tag tokenIDs="w3">NN</tag>
       <tag tokenIDs="w4">SENT</tag>
       <tag tokenIDs="w5">DT</tag>
       <tag tokenIDs="w6">VBZ</tag>
       <tag tokenIDs="w7">RB</tag>
       <tag tokenIDs="w8">DT</tag>
       <tag tokenIDs="w9">NN</tag>
       <tag tokenIDs="w10">SENT</tag>
       <tag tokenIDs="w11">DT</tag>
       <tag tokenIDs="w12">VBZ</tag>
       <tag tokenIDs="w13">RB</tag>
       <tag tokenIDs="w14">DT</tag>
       <tag tokenIDs="w15">NN</tag>
       <tag tokenIDs="w16">SENT</tag>
     </POStags>
   </TextCorpus>
 </D-Spin>

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2016-2020 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

DiaColloDB::Document(3pm), DiaColloDB(3pm), perl(1), ...