The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

dtatw-files.perl - file formats used by dta-tokwrap utilities

SYNOPSIS

 FILENAME    (STATUS)   DESCRIPTION

 *.xml       (input)    input XML file in DTA "base-format"
 *.chr.xml   (input)    common convention for input files
 *.char.xml  (input)    another common convention for input files
 
 *.cx        (temp)     character index (CSV,TAB-separated)
 *.sx        (temp)     structure index (XML)
 *.tx        (temp)     text index (UTF-8 text)
 *.bx0       (temp)     preliminary "block index" (XML)
 *.bx        (temp)     block index (CSV,TAB-separated)
 *.txt       (temp)     serialized text (UTF-8 text)
 *.t         (temp)     tokenizer output (.tt, TAB-separated)
 *.cpx       (temp)     character+page index (CSV,TAB-separated)
 *.wpx       (temp)     word+page index (CSV,TAB-separated)
 
 *.t.xml     (output)   master serial XML output (XML)
 *.s.xml     (output)   sentence-level standoff (XML)
 *.w.xml     (output)   token-level standoff (XML)
 *.a.xml     (output)   token-analysis-level standoff (XML)
 
 *.u.xml     (output)   extended serial XML output (XML)
 *.cw.xml    (output)   base-format + tokens (XML)
 *.cws.xml   (output)   base-format + tokens + sentences (XML)

DESCRIPTION

This manual describes the file formats currently used by the dta-tokwrap utilities.

Input File Formats

*.xml

Alias(es): *.chr.xml, *.char.xml

Input XML file in DTA "base-format" (UTF8-encoded XML with one c element per character):

  • input documents MUST be encoded in UTF-8,

  • all text nodes to be tokenized should be descendants of a <c> element which is itself a descendant of a <text> element (XPath //text//c//text()),

  • each input document should contain exactly one such <c> element for each logical character which may be passed to the tokenizer,

  • no <c> element may be a descendant of another <c> element, and

  • each c element should have a valid xml:id attribute.

Example:

 <?xml version="1.0" encoding="UTF-8"?>
 <TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:dta="http://www.deutsches-textarchiv.de/ns/1.0">
   <!-- ... -->
   <text>
     <!-- ... -->
     <c xml:id="c1"> </c>
     <c xml:id="c2">U</c>
     <c xml:id="c3">e</c>
     <c xml:id="c4">b</c>
     <c xml:id="c5">e</c>
     <c xml:id="c6">r</c>
     <c xml:id="c7"> </c>
     <c xml:id="c8">d</c>
     <c xml:id="c9">i</c>
     <c xml:id="c10">e</c>
     <c xml:id="c11"> </c>
     <!-- ... -->
   </text>
   <!-- ... -->
 </text>

Temporary File Formats

*.cx

Character index file (TAB-separated text) as created by dtatw-mkindex. Used for translating between byte offsets and xml:ids.

Example:

 %% <c>-element index generated by ../src/dtatw-mkindex
 %% Package: dta-tokwrap version 0.04 / svn+ssh://odo.dwds.de/home/svn/dev/dta-tokwrap/trunk @ 2445:2447
 %% Command-line: ../src/dtatw-mkindex 'xmlsrc/ex1.xml' 'ex1.cx' 'ex1.sx' 'ex1.tx'
 %%======================================================================
 %% $ID$        $XML_OFFSET$    $XML_LENGTH$    $TXT_OFFSET$    $TXT_LEN$       $TEXT$
 c1     276     20      0       1        
 c2     382     20      1       1       U
 c3     402     20      2       1       e
 c4     422     20      3       1       b
 c5     442     20      4       1       e
 c6     462     20      5       1       r
 c7     482     20      6       1        
 c8     502     20      7       1       d
 c9     522     20      8       1       i
 c10    542     21      9       1       e
 c11    563     21      10      1        

*.sx

Structure index (XML) as created by dtatw-mkindex. All XPaths //text//c|//text//lb have been removed and replaced by placeholder c elements for each contiguous block of original c and lb elements. The placeholder elements have the form:

 <c n="XOFF XLEN TOFF TLEN"/>

where XOFF,XLEN are byte-offset and -length in the source XML file (*.xml) and TOFF,TLEN are byte-offset and -length in the raw text index file (*.tx).

 <?xml version="1.0" encoding="UTF-8"?>
 <TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:dta="http://www.deutsches-textarchiv.de/ns/1.0">
   <!-- ... -->
   <text>
      <titlePage>
        <c n="338 11 1 0"/>
        <docTitle>
          <c n="349 10 1 0"/>
          <titlePart type="main">
            <c n="359 23 1 0"/>
            <c n="382 1666 1 82"/>
          </titlePart>
          <c n="2048 12 83 0"/>
          <c n="2060 5 83 1"/>
        <!-- ... -->
      </titlePage>
   </text>
   <!-- ... -->
 </text>

*.tx

Raw, unserialized text index (UTF-8 text) as created by dtatw-mkindex. Results from concatenating all //text//c//text() nodes from the source document, and inserting newlines for //text//lb elements.

Example:

  Ueber die Beeinflussung
 einfacher psychischer Vorgänge
 durch einige Arzneimittel.
 Experimentelle Untersuchungen
 von
 Dr. Emil Kraepelin,
 Professor der Psychiatrie in Heidelberg.
 Mit einer Curventafel.
 Jena,
 Verlag von Gustav Fischer.
 1892.

*.bx0

Preliminary "block index" (XML) as created by "dta-tokwrap.perl -t mkbx0". Generated from the *.sx file by inserting zero or more "hints" of one of the following forms:

 <s/>    <!-- sentence-break hint -->
 <w/>    <!-- token-break hint    -->
 <lb/>   <!-- line-break hint     -->

Zero or more output elements may also be assigned a dta.tw.key attribute, which should be some unique key identifying the logical block or segment with which any text descended from that element should be sorted during serialization (this is how we get seg elements to clump together). dta.tw.key attributes are inherited by default.

Also note that namespaces have been forcibly removed from the XML structure.

Example:

 <?xml version="1.0" encoding="UTF-8"?>
 <TEI dta.tw.key="TEI.id2369102" _xmlns="http://www.tei-c.org/ns/1.0" xmlns_dta="http://www.deutsches-textarchiv.de/ns/1.0">
   <!-- ... -->
   <text>
      <titlePage>
        <s/>
        <c n="338 11 1 0"/>
        <docTitle>
          <c n="349 10 1 0"/>
          <titlePart type="main">
            <s/>
            <c n="359 23 1 0"/>
            <c n="382 1666 1 82"/>
            <s/>
          </titlePart>
          <c n="2048 12 83 0"/>
          <c n="2060 5 83 1"/>
        </s>
      </titlePage>
   </text>
   <!-- ... -->
 </TEI>

*.bx

Block index (TAB-separated text) as created by "dta-tokwrap.perl -t mkbx". Used for translating between serialized-text (.txt) byte offsets and raw-text (.tx) byte offsets, which in turn gets us to c/@xml:ids. Still with me? Good.

Example:

 %% XML block list file generated by DTA::TokWrap::Document::saveBxFile() (DTA::TokWrap version 0.04)
 %% Original source file: ./xmlsrc/ex1.xml
 %%======================================================================
 %% $KEY$       $ELT$   $XML_OFFSET$    $XML_LENGTH$    $TX_OFFSET$     $TX_LEN$        $TXT_OFFSET$    $TXT_LEN$
 __ROOT__       __ROOT__        0       0       0       0       0       0
 TEI.id2406247  s       176     0       0       0       0       6
 TEI.id2406247  s       176     0       0       0       6       6
 TEI.id2406247  s       215     0       0       0       12      6
 TEI.id2406247  s       227     0       0       0       18      6
 TEI.id2406247  s       258     0       0       0       24      6
 TEI.id2406247  c       270     26      0       1       30      1
 TEI.id2406247  s       270     0       0       0       31      6

*.txt

Serialized text (UTF-8 text) as created by "dta-tokwrap.perl -t mktxt", possibly containing tokenizer "hints", to be passed to the underlying tokenizer.

The precise form taken by the hints in this file depends on many things, notably the options --strong-hints, --weak-hints, and --no-hints to dta-tokwrap.perl. You should ensure that your tokenizer is prepared to deal with whatever flavor of hints you are passing it (in particular, don't use the dwds_tomasotath tokenizer together with the --strong-hints option, unless you want it to return a lot of ($, WB, $) "tokens".

Example:

 $SB$
 Ueber die Beeinflussung
 einfacher psychischer Vorgänge
 durch einige Arzneimittel.
 $SB$
 
 $SB$
 Experimentelle Untersuchungen
 $SB$

*.t

Tokenizer output (.tt, TAB-separated UTF-8 text). The first non-text field should contain "TXTOFF TXTLEN" pairs, where TXTOFF and TXTLEN are byte-offset and -length in the *.txt file. These data are required for recovery of c element IDs. See mootfiles(5) for details on the file format.

Example:

 %% raw tokenizer output generated by ../src/dtatw-tokenize-dummy (dta-tokwrap version 0.04)
 Ueber  49 5
 die    55 3
 Beeinflussung  59 13
 einfacher      73 9
 psychischer    83 11
 Vorgänge       95 9
 durch  105 5
 einige 111 6
 Arzneimittel   118 12
 .      130 1   $.

*.cpx

Character+pagebreak index (CSV, TAB-separated). Used in generation of *.u.xml files.

Example:

 %% <(^c$)>+<pb> index generated by ../scripts/dtatw-mkpx.perl
 %%======================================================================
 %%$X_ID        $PB_I   $PB_N   $PB_FACS        $X_XPATH        
 c1     0       NULL    NULL    /TEI[1]/text[1]/c[1]
 c2     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/c[1]
 c3     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/c[2]
 c4     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/c[3]
 c5     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/c[4]
 c6     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/c[5]
 c7     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/c[6]
 c8     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/c[7]

*.wpx

Token+pagebreak index (CSV, TAB-separated). Used in generation of *.u.xml files. Format is same as *.cpx, but IDs are token-ids.

Example:

 %% <(^w$)>+<pb> index generated by ../scripts/dtatw-mkpx.perl
 %%======================================================================
 %%$X_ID        $PB_I   $PB_N   $PB_FACS        $X_XPATH        
 w1     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[1]
 w2     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[2]
 w3     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[3]
 w4     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[4]
 w5     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[5]
 w6     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[6]
 w7     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[7]
 w8     7       NULL    NULL    /TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]/w[8]

Output File Formats

*.t.xml

Master XML-ified tokenizer output (XML). X-Paths:

 /*/s        : sentence
 /*/s/w      : token: <w @xml:id b="TXTOFF TXTLEN" t="TEXT" c="C_IDS">...</w>
 //w/a       : token analysis: <a>ANALYSIS_TEXT</a>
 //w//*      : (additional analysis data, inserted e.g. by DTA::CAB utilities)
 //w/@xml:id : token id (unique within document, counted in serialized order)
 //w/@b      : byte-offset and length of token in tokenizer input *.txt
 //w/@t      : token text as output by tokenizer
 //w/@c      : space-separated list of //c/@id for token characters

This format can also be passed directly to and from the DTA::CAB(3pm) analysis suite using the DTA::CAB::Format::XmlNative(3pm) formatter class.

Example:

 <?xml version="1.0" encoding="UTF-8"?>
 <sentences xml:base="ex1.xml">
  <s xml:id="s1">
    <w xml:id="w1" b="49 5" t="Ueber" c="c2 c3 c4 c5 c6"/>
    <w xml:id="w2" b="55 3" t="die" c="c8 c9 c10"/>
    <w xml:id="w3" b="59 13" t="Beeinflussung" c="c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24"/>
    <w xml:id="w4" b="73 9" t="einfacher" c="c25 c26 c27 c28 c29 c30 c31 c32 c33"/>
    <w xml:id="w5" b="83 11" t="psychischer" c="c35 c36 c37 c38 c39 c40 c41 c42 c43 c44 c45"/>
    <w xml:id="w6" b="95 9" t="Vorgänge" c="c47 c48 c49 c50 c51 c52 c53 c54"/>
    <w xml:id="w7" b="105 5" t="durch" c="c55 c56 c57 c58 c59"/>
    <w xml:id="w8" b="111 6" t="einige" c="c61 c62 c63 c64 c65 c66"/>
    <w xml:id="w9" b="118 12" t="Arzneimittel" c="c68 c69 c70 c71 c72 c73 c74 c75 c76 c77 c78 c79"/>
    <w xml:id="w10" b="130 1" t="." c="c80">
      <a>$.</a>
    </w>
  </s>
 <!-- ... -->
 </sentences>

*.s.xml

Sentence-level standoff XML. DEPRECATED in favor of *.t.xml, *.u.xml.

Example:

 <?xml version="1.0" encoding="UTF-8"?>
 <sentences xml:base="ex1.w.xml">
  <s xml:id="s1">
    <w ref="#w1"/>
    <w ref="#w2"/>
    <w ref="#w3"/>
    <w ref="#w4"/>
    <w ref="#w5"/>
    <w ref="#w6"/>
    <w ref="#w7"/>
    <w ref="#w8"/>
    <w ref="#w9"/>
    <w ref="#w10"/>
  </s>
  <!-- ... -->
 </sentences>

*.w.xml

Token-level standoff XML. DEPRECATED in favor of *.t.xml, *.u.xml.

Example:

 <?xml version="1.0" encoding="UTF-8"?>
 <tokens xml:base="ex1.xml">
  <w xml:id="w1" t="Ueber">
    <c ref="#c2"/>
    <c ref="#c3"/>
    <c ref="#c4"/>
    <c ref="#c5"/>
    <c ref="#c6"/>
  </w>
  <w xml:id="w2" t="die">
    <c ref="#c8"/>
    <c ref="#c9"/>
    <c ref="#c10"/>
  </w>
  <w xml:id="w3" t="Beeinflussung">
    <c ref="#c12"/>
    <c ref="#c13"/>
    <c ref="#c14"/>
    <c ref="#c15"/>
    <c ref="#c16"/>
    <c ref="#c17"/>
    <c ref="#c18"/>
    <c ref="#c19"/>
    <c ref="#c20"/>
    <c ref="#c21"/>
    <c ref="#c22"/>
    <c ref="#c23"/>
    <c ref="#c24"/>
  </w>
  <!-- ... -->
 </tokens>

*.a.xml

Token-analysis-level standoff XML. Currently contains only analyses supplied by the tokenizer. DEPRECATED in favor of *.t.xml, *.u.xml.

Example:

 <?xml version="1.0" encoding="UTF-8"?>
 <analyses xml:base="ex1.w.xml">

  <a ref="#w10">$.</a>
  <a ref="#w14">$ABBR</a>
  <a ref="#w17">$,</a>
  <a ref="#w23">$.</a>
  <a ref="#w27">$.</a>
  <a ref="#w29">$,</a>
  <a ref="#w34">$.</a>
  <a ref="#w35">$CARDPUNCT</a>
  <!-- ... -->
 </analyses>

*.u.xml

Extended serialized XML format, based on *.t.xml with additional XPaths:

 //s/@xp   : common source-XML XPath prefix for all sentence tokens
 //w/@xp   : XPath suffix (of ../@xp) for token
 //w/@t0   : tokenizer input text (including e.g. newlines) if different from @t
 //w/@u    : unicruft approximation of @t, if not equal to @t
 //w/@u0   : unicruft approximation of @t0m if not equal to @u
 //w/@pb   : index of last //pb before onset of //w
 //w/@cs   : character spans: "CID+LEN CID+LEN ... CID+LEN"; replaces @c

... and removed XPaths:

 //w/@c    : removed in favor of //w/@cs
 //w/@b    : removed in favor of //w/@cs, //w/@t0

Example:

 <?xml version="1.0" encoding="UTF-8"?>
 <sentences xml:base="ex1a.xml">
  <s xml:id="s1" xp="/TEI[1]/text[1]/front[1]/titlePage[1]/docTitle[1]/titlePart[1]">
    <w xml:id="w1" t="Ueber" pb="7" xp="-/c[1]" cs="c2+5"/>
    <w xml:id="w2" t="die" pb="7" xp="-/c[7]" cs="c8+3"/>
    <w xml:id="w3" t="Beeinflussung" pb="7" xp="-/c[11]" cs="c12+13"/>
    <w xml:id="w4" t="einfacher" pb="7" xp="-/c[24]" cs="c25+9"/>
    <w xml:id="w5" t="psychischer" pb="7" xp="-/c[34]" cs="c35+11"/>
    <w xml:id="w6" t="Vorg�nge" pb="7" xp="-/c[46]" cs="c47+8"/>
    <w xml:id="w7" t="durch" pb="7" xp="-/c[54]" cs="c55+5"/>
    <w xml:id="w8" t="einige" pb="7" xp="-/c[60]" cs="c61+6"/>
    <w xml:id="w9" t="Arzneimittel" pb="7" xp="-/c[67]" cs="c68+12"/>
    <w xml:id="w10" t="." pb="7" xp="-/c[79]" cs="c80+1">
      <a>$.</a>
    </w>
  </s>
 </sentences>

*.cw.xml

Base-format XML file with tokens encoded as w elements, as output by dtatw-add-w.perl.

Example:

 <?xml version="1.0"?>
 <TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:dta="http://www.deutsches-textarchiv.de/ns/1.0">
  <!-- ... -->
  <text>
    <!-- ... -->
          <titlePart type="main">
              <w xml:id="w1">
                <c xml:id="c2">U</c>
                <c xml:id="c3">e</c>
                <c xml:id="c4">b</c>
                <c xml:id="c5">e</c>
                <c xml:id="c6">r</c>
              </w>
              <c xml:id="c7"> </c>
              <w xml:id="w2">
                <c xml:id="c8">d</c>
                <c xml:id="c9">i</c>
                <c xml:id="c10">e</c>
              </w>
              <c xml:id="c11"> </c>
              <!-- ... -->
              <w xml:id="w10">
                <c xml:id="c80">.</c>
              </w>
          </titlePart>
    <!-- ... -->
  </text>
  <!-- ... -->
 </TEI>

*.cws.xml

Base-format XML file with tokens and sentences encoded as w and s elements respectively, as output by dtatw-add-s.perl.

Example:

 <?xml version="1.0"?>
 <TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:dta="http://www.deutsches-textarchiv.de/ns/1.0">
  <!-- ... -->
  <text>
    <!-- ... -->
          <titlePart type="main">
            <s xml:id="s1">
              <w xml:id="w1">
                <c xml:id="c2">U</c>
                <c xml:id="c3">e</c>
                <c xml:id="c4">b</c>
                <c xml:id="c5">e</c>
                <c xml:id="c6">r</c>
              </w>
              <c xml:id="c7"> </c>
              <w xml:id="w2">
                <c xml:id="c8">d</c>
                <c xml:id="c9">i</c>
                <c xml:id="c10">e</c>
              </w>
              <c xml:id="c11"> </c>
              <!-- ... -->
              <w xml:id="w10">
                <c xml:id="c80">.</c>
              </w>
            </s>
          </titlePart>
    <!-- ... -->
  </text>
  <!-- ... -->
 </TEI>

SEE ALSO

dtatw-add-c.perl(1), dtatw-add-w.perl(1), dtatw-add-s.perl(1), dta-tokwrap.perl(1), dtatw-txml2uxml.perl(1), DTA::TokWrap::Intro(3pm), ...

AUTHOR

Bryan Jurish <moocow@cpan.org>

1 POD Error

The following errors were encountered while parsing the POD:

Around line 189:

Non-ASCII character seen before =encoding in 'Vorgänge'. Assuming UTF-8