The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

dta-tokwrap.perl - top-level tokenizer wrapper for DTA XML documents

SYNOPSIS

 dta-tokwrap.perl [OPTIONS] XMLFILE(s)...
 
 General Options:
  -help                  # show this help message
  -man                   # show complete manpage
  -verbose LEVEL         # set verbosity level (0<=level<=7; default=1)
 
 Make Emulation Options:
  -list-targets          # just list known targets
  -targets TARGETS       # set build targets (default='all')
  -make , -nomake        # do/don't emulate make-style dependency tracking (default=don't)
  -remake                # force rebuilding of all targets (implies -make)
  -force-target TARGET   # for -make mode, force rebuilding of TARGET
  -force                 # alias for -force-target=all
  -noforce               # overrides all preceeding -force and -force-target flags
 
 Subprocessor Options:
  -rcdir RCDIR           # resource directory (default=$ENV{TOKWRAP_RCDIR} or /usr/local/share/dta-resources)
  -inplace , -noinplace  # do/don't use locally built programs if available (default=do)
  -sb-xpath XPATH        # add sentence-break hints on XPATH (element) open and close
  -wb-xpath XPATH        # add word-break hints on XPATH (element) open and close
  -hints, -nohints       # do/don't generate "hints" for the tokenizer (default=do)
  -weak-hints            # use whitespace-only hints rather than defaults ($WB$,$SB$)
  -strong-hints          # opposite of -weak-hints
  -abbrev-lex=FILE       # abbreviation lexicon for dwds_tomasotath or waste tokenizer
  -mwe-lex=FILE          # multiword-expression lexicon for dwds_tomasotath tokenizer
  -stop-lex=FILE         # stopword lexicon for waste tokenizer
  -conj-lex=FILE         # conjunction lexicon for waste tokenizer
  -waste-model=FILE      # HMM file for waste tokenizer
  -waste-dir=DIR         # waste base directory (defaults for -abbr-lex, -stop-lex, -conj-lex, -waste-model)
  -procopt OPT=VALUE     # set arbitrary subprocessor options
 
 I/O Options:
  -outdir OUTDIR         # set output directory (default=.)
  -tmpdir TMPDIR         # set temporary directory (default=$ENV{DTATW_TMP} or OUTDIR)
  -keep , -nokeep        # do/don't keep temporary files (default=don't)
  -format , -noformat    # do/don't pretty-print XML output (default=do)
  -docopt OPT=VALUE      # set arbitrary document options (e.g. filenames)
 
 Logging Options:
  -log-config RCFILE     # use Log::Log4perl configuration file RCFILE (default=internal)
  -log-level LEVEL       # set minimum log level
  -log-file LOGFILE      # log to file LOGFILE (default=none)
  -stderr  , -nostderr   # do/don't log to console (default=do)
  -profile , -noprofile  # do/don't log profiling information (default=do)
  -silent  , -quiet      # alias for -verbose=0 -log-level=FATAL -notrace
 
 Trace and Debugging Options:
  -dump-xsl PREFIX       # dump generated XSL stylesheets to PREFIX*.xsl and exit
  -dummy , -nodummy      # don't/do actually run any subprocessors (default=do)
  -tokenizer-class CLASS # specify tokenizer subclass (e.g. http, waste, dummy, tomasotath_04x, ...)
  -dummy-tokenizer       # alias for -tokenizer-class=dummy
  -http-tokenizer        # alias for -tokenizer-class=http
  -trace , -notrace      # do/don't log trace messages (default: depends on -verbose)
  -traceAll              # enable logging of all possible trace messages
  -notraceAll            # disable logging of all possible trace messages
  -traceLevel LEVEL      # set trace logging level (default='trace')
  -traceX, -notraceX     # do/don't trace "X" (X={Open|Load|Save|Make|...})
  -traceXLevel LEVEL     # set log level for "X" traces (X={Open|...})

OPTIONS

General Options

-help

Display a short help message and exit.

-man

Display the complete program manpage and exit.

-verbose LEVEL

Set verbosity level (0<=level<=7; default=0)

Make Emulation Options

-targets TARGETS

Set build targets (default="all"). Multiple TARGETS may be separated by whitespace, commas, or by passing multiple -targets options. See "Known Targets" for a list of currently defined targets.

-make , -nomake

Do/don't emulate experimental make-style dependency tracking (default=don't). Use of -make mode may be faster (because it requires less file I/O).

-remake

Force rebuilding of all targets (implies -make).

-force-target TARGET

For -make mode, force rebuilding of TARGET.

-force

Alias for -force-target=all

-noforce

Overrides all preceeding "-force" and -force-target flags.

Subprocessor Options:

-inplace , -noinplace

Do/don't use locally built programs if available (default=do). This is useful if you want to test a development version (-inplace) and an installed system version (-noinplace) of this package on the same machine.

-sb-xpath XPATH

Tells the mkbx0 subprocessor to add sentence-break hints on XPATH (which should resolve only to element nodes) open and close. XPATH is included in the generated hint.xsl XSL stylesheet as a match item, so it can include e.g. top-level unions, but no nested unions.

This option may be specified more than once.

-wb-xpath XPATH

Tells the mkbx0 subprocessor to add sentence-break hints on XPATH (which should resolve only to element nodes) open and close. Same caveats as for "-sb-xpath XPATH"

This option may be specified more than once.

-hints , -nohints

Do/don't generate explicit sentence- and/or token-break "hints" for the tokenizer in the temporary .txt file (default=do). Explicit hint strings can be set with -procopt wbStr=WORDBREAK_HINT_STRING and/or -procopt sbStr=SENTBREAK_HINT_STRING; see -procopt below for details.

-weak-hints

If generating tokenizer "hints", use whitespace-only hints rather than defaults "\n$WB$\n", "\n$SB$\n". This can be useful if your low-level tokenizer doesn't understand the explicit hints, but might be predisposed to break tokens and/or sentences on whitespace.

-strong-hints

Opposite of -weak-hints.

-abbrev-lex=FILE

Abbreviation lexicon for dwds_tomasotath tokenizer. Default is (usually) /usr/local/share/dta-resources/dta_abbrevs.lex.

FILE may be specified as the empty string to avoid use of an abbreviation lexicon altogether, although this is likely to weak havoc with dwds_tomasotath's sentence-boundary recognition.

-mwe-lex=FILE

Multiword-expression lexicon for dwds_tomasotath tokenizer. Default is (usually) /usr/local/share/dta-resources/dta_mwe.lex.

FILE may be specified as the empty string to avoid use of a multiword-expression lexicon altogether, although this might cause problems with dwds_tomasotath.

-procopt OPT=VALUE

Set a literal arbitrary subprocessor option OPT to VALUE. See subprocessor module documentation for available options.

I/O Options

-outdir OUTDIR

Set output directory (default=.)

-tmpdir TMPDIR

Set directory for storing temporary files. Default value is taken from the environment variable $DTATW_TMP if it is set, otherwise the default is the value of OUTDIR (see -outdir).

-keep , -nokeep

Do/don't keep temporary files, rather than deleting them when they are no longer needed (default=don't).

-format , -noformat

Do/don't pretty-print XML output when possible (default=do).

docopt OPT=VALUE

Set arbitrary DTA::TokWrap::Document options (e.g. filenames). See DTA::TokWrap::Document(3pm) for details.

Logging Options

-log-config RCFILE

Use Log::Log4perl configuration file RCFILE, rather than the default internal configuration. See Log::Log4perl(3pm) for details on the syntax of RCFILE.

-log-level LEVEL

Set minimum log level. Only effective if the default (internal) log configuration is being used.

-log-file LOGFILE

Send log output to file LOGFILE (default=none). Only effective if the default (internal) log configuration is being used.

-stderr , -nostderr

Do/don't log to console (default=do). Only effective if the default (internal) log configuration is being used.

-profile , -noprofile

Do/don't log profiling information (default=do).

-silent , -quiet

Alias for -verbose=0 -log-level=FATAL -notrace.

Trace and Debugging Options

-dump-xsl PREFIX

Dumps generated XSL stylesheets to PREFIX*.xsl and exits. Useful for debugging. Causes the following files to be written:

 ${PREFIX}mkbx0_hint.xsl    # hint insertion
 ${PREFIX}mkbx0_sort.xsl    # serialization sort-key generation
 ${PREFIX}standoff_t2s.xsl  # master XML to sentence standoff
 ${PREFIX}standoff_t2w.xsl  # master XML to token standoff
 ${PREFIX}standoff_t2a.xsl  # master XML to analysis standoff
-dummy , -nodummy

Don't/do actually run any subprocessors (default=do)

-dummy-tokenizer , -nodummy-tokenizer

Do/don't use locally built dummy tokenizer instead of tomata2.

-trace , -notrace

Do/don't log trace messages (default: depends on the current -verbose level; see -verbose).

-traceAll

Enable logging of all possible trace messages. Warning: this generates a lot of log output.

-notraceAll

Disable logging of all possible trace messages.

-traceLevel LEVEL

Set log level to use for trace messages (default='trace'). LEVEL is one of the following: trace, debug, info, warn, error, fatal. Any other value for LEVEL causes trace messages not to be logged.

-traceX , -notraceX

Do/don't log trace messages for the trace flavor X, where X is one of the following:

 Open    # document object open() method
 Close   # document object close() method
 Proc    # document processing method calls
 Load    # load document data file
 Save    # save document data file
 Make    # document target (re-)making (including status-check)
 Gen     # document target (re-)generation
 Subproc # low-level subprocessor calls
 Run     # external system command
-traceXLevel LEVEL

Set log level for X-type traces to LEVEL. X is a trace message flavor as described in -traceX, and LEVEL is as described in -traceLevel.

ARGUMENTS

All other command-line arguments are assumed to be filenames of DTA "base-format" XML files, which are simply (TEI-conformant) UTF-8 encoded XML files with one (optional as of dta-tokwrap v0.38) <c> element per character:

  • the document MUST be encoded in UTF-8,

  • all text nodes to be tokenized should be descendants of a <text> element, and may optionally be immediate daughters of a <c> element (XPath //text//text()|//text//c/text()). <c> elements may not be nested.

    Prior to dta-tokwrap v0.38, <c> elements were required.

DESCRIPTION

This program is intended to provide a flexible high-level command-line interface to the tokenization of DTA "base-format" XML documents, generating e.g. sentence-, token-, and analysis-level standoff XML annotations for each input document.

The problem can be run in one of two main modes; see "Modes of Operation" for details on these. In either mode, it can be used either as a standalone batch-processor for one or more input documents, or called by a superordinate build system, e.g. GNU make (see make(1)). Program operation is controlled primarily by the specification of one or more "targets" to build for each input document; see "Known Targets" for details.

Modes of Operation

The program can be run in one of two modes of operation, "-make Mode" and "-nomake Mode".

-make Mode

(DEPRECATED)

In this (deprecated) mode, the program attempts to emulate the dependency tracking features of make by (re-)building only those targets which either do not yet exist, or which are older than one or more of their dependencies. Since some dependencies are ephemeral, existing only in RAM during a single program run, this can mean a lot of pain for comparatively little gain.

-make mode is enabled by specifying the -make option on the command-line.

-nomake Mode

In this (experimental) mode, no implicit dependency tracking is attempted, and all required data files (input, "temporary", and/or output) must exist when the requested target is built; otherwise an error results. -nomake mode can be somewhat slower than -make mode, since "temporary" data (which in -make mode are RAM-only ephemera) may need to be bounced off the filesystem.

-nomake mode is the default mode, and may be (re-)enabled (overriding any preceding -make option) by specifying the -nomake option on the command-line.

Known Targets

-make Targets

The following targets are known values for the -targets option in "-make Mode":

all
(not yet documented)

-nomake Targets

The following targets are known values for the -targets option in "-nomake Mode":

mkindex

Alias(es): cx sx tx xx

Input(s): FILE.xml

Output(s): FILE.cx, FILE.sx, FILE.tx

Creates temporary "character index" FILE.cx (CSV), "structure index" FILE.sx (XML without <c> elements), and "text index" FILE.tx (raw text, unserialized) for each input document FILE.xml.

mkbx0

Alias(es): bx0

Input(s): FILE.sx

Output(s): FILE.bx0

Creates temporary hint- and serialization index FILE.bx0 for each input document FILE.xml

mkbx

Alias(es): mktxt bx txt

Input(s): FILE.bx0, FILE.tx

Output(s): FILE.bx, FILE.txt

Creates temporary serialized block-index file FILE.bx and serialized text file FILE.txt for each input document FILE.xml.

mktok0

Alias(es): tokenize0 tok0 t0 tt0

Input(s): FILE.txt

Output(s): FILE.t0

Creates temporary CSV-format raw tokenizer output file FILE.t0 for each input document FILE.xml

mktok1

Alias(es): tokenize1 tok1 t1 tt1

Input(s): FILE.t0

Output(s): FILE.t1

Creates temporary CSV-format post-processed tokenizer output file FILE.t1 for each input document FILE.xml

mktok

Alias(es): tokenize tok t tt

Input(s): FILE.txt

Output(s): FILE.t0 FILE.t1

Wrapper for "mktok0 mktok1".

mktxml

Alias(es): tok2xml xtok txml ttxml tokxml

Input(s): FILE.t, FILE.bx, FILE.cx

Output(s): FILE.t.xml

Creates master tokenized XML output file FILE.t.xml for each input document FILE.xml

addws

Alias(es): mkcws cwsxml cws

Input(s): FILE.xml FILE.t.xml

Output(s): FILE.cws.xml

Creates "spliced" XML output "Frankenfile" FILE.cws.xml for each input document FILE.xml ; see also dtatw-splice.perl(1).

mksxml

Alias(es): mksos sosxml sosfile sxml

Input(s): FILE.t.xml

Output(s): FILE.s.xml

DEPRECATED

Creates sentence-level stand-off XML file FILE.s.xml for each input document FILE.xml

mkwxml

Alias(es): mksow sowxml sowfile wxml

Input(s): FILE.t.xml

Output(s): FILE.w.xml

DEPRECATED

Creates token-level stand-off XML file FILE.w.xml for each input document FILE.xml

mkaxml

Alias(es): mksoa sowaml soafile axml

Input(s): FILE.t.xml

Output(s): FILE.a.xml

DEPRECATED

Creates token-analysis-level stand-off XML file FILE.a.xml for each input document FILE.xml

mkstandoff

Alias(es): standoff so mkso

DEPRECATED

Alias for mksxml, mkwxml, mkaxml.

all

Alias(es): (none)

Input(s): FILE.xml

Output(s): FILE.t.xml, FILE.cws.xml

Alias for all targets required to generated the target's output files (master tokenized file and spliced output) from the input document, run in the proper order.

tei2t

Aliases: (none)

Input(s): FILE.xml

Output(s): FILE.t

Alias for all targets required to generated fixed tokenizer output FILE.t from a TEI-XML file FILE.xml, run in the proper order.

tei2txml

Aliases: (none)

Input(s): FILE.xml

Output(s): FILE.t.xml

Alias for all targets required to generated a flat tokeized XML file FILE.t.xml from a TEI-XML file FILE.xml, run in the proper order.

SEE ALSO

DTA::TokWrap::Intro(3pm), dtatw-add-c.perl(1), dtatw-add-w.perl(1), dtatw-add-s.perl(1), dtatw-rm-c.perl(1), dtatw-splice.perl(1), ...

AUTHOR

Bryan Jurish <moocow@cpan.org>