DTA::TokWrap::Processor::tok2xml::perl - DTA tokenizer wrappers: t -> t.xml, pure-perl (slow, obsolete)
use DTA::TokWrap::Processor::tok2xml::perl; $t2x = DTA::TokWrap::Processor::tok2xml::perl->new(%opts); $doc_or_undef = $t2x->tok2xml($doc);
This module is deprecated; prefer DTA::TokWrap::Processor::tok2xml.
DTA::TokWrap::Processor::tok2xml::perl provides a pure-perl object-oriented DTA::TokWrap::Processor wrapper for converting "raw" CSV-format (.t) low-level tokenizer output to a "master" tokenized XML (.t.xml) format, for use with DTA::TokWrap::Document objects.
Most users should use the high-level DTA::TokWrap wrapper class instead of using this module directly.
DTA::TokWrap::Processor::tok2xml::perl inherits from DTA::TokWrap::Processor, and supports basically the same API as DTA::TokWrap::Processor::tok2xml.
Integer indicating a missing or implicit 'c' record; should be equivalent in value to the C code:
unsigned int NOC = ((unsigned int)-1)
for 32-bit "unsigned int"s.
$t2x = $CLASS_OR_OBJECT->new(%args);
Constructor.
%args, %$t2x:
##-- output document structure docElt => $elt, ##-- output document element sElt => $elt, ##-- output sentence element wElt => $elt, ##-- output token element aElt => $elt, ##-- output token-analysis element posAttr => $attr, ##-- output byte-position attribute textAttr => $attr, ##-- output token-text attribute
You probably should NOT change any of the default output document structure options (unless this is the final module in your processing pipeline), since their values have ramifications beyond this module.
%defaults = CLASS->defaults();
Static class-dependent defaults.
$doc_or_undef = $CLASS_OR_OBJECT->tok2xml($doc);
Converts "raw" CSV-format (.t) low-level tokenizer output to a "master" tokenized XML (.t.xml) format in the DTA::TokWrap::Document object $doc.
Relevant %$doc keys:
bxdata => \@bxdata, ##-- (input) block index data tokdata => $tokdata, ##-- (input) tokenizer output data (string) cxdata => \@cxchrs, ##-- (input) character index data (array of arrays) cxfile => $cxfile, ##-- (input) character index file xtokdata => $xtokdata, ##-- (output) tokenizer output as XML nchrs => $nchrs, ##-- (output) number of character index records ntoks => $ntoks, ##-- (output) number of tokens parsed ## tok2xml_stamp0 => $f, ##-- (output) timestamp of operation begin tok2xml_stamp => $f, ##-- (output) timestamp of operation end xtokdata_stamp => $f, ##-- (output) timestamp of operation end
$%t2x keys (temporary, for debugging):
tb2ci => $tb2ci, ##-- (temp) s.t. vec($tb2ci, $txbyte, 32) = $char_index_of_txbyte ntb => $ntb, ##-- (temp) number of text bytes
may implicitly call $doc->mkbx(), $doc->loadCxFile(), $doc->tokenize() (but shouldn't!)
\$tb2ci = $t2x->txbyte_to_ci(\@cxdata);
Low-level utility method.
Sets %$t2x keys: tb2ci, ntb, nchr
\$ob2ci = $t2x->txtbyte_to_ci(\@cxdata,\@bxdata,\$tb2ci);
Low-level utility method
Sets %$t2x keys: ob2ci
\$tokxmlr = $t2x->process_tt_data($doc);
Actually populates $doc->{xtokdata} by parsing $doc->{tokdata}, referring to $t2x->{ob2ci} for character-index lookup.
DTA::TokWrap::Processor::tok2xml(3pm), DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...
DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...
Bryan Jurish <moocow@cpan.org>
Copyright (C) 2009-2018 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.
To install DTA::TokWrap, copy and paste the appropriate command in to your terminal.
cpanm
cpanm DTA::TokWrap
CPAN shell
perl -MCPAN -e shell install DTA::TokWrap
For more information on module installation, please visit the detailed CPAN module installation guide.