The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

cwb-align-tmx2beads - Export existing aligned data from TMX for use with CWB.

SYNOPSIS

  cwb-align-tmx2beads [options]

Options:

  -i <file>, --tmx-input=<file>    specify input TMX file (can be used multiple times; required at least once)
  -o <file>, --bead-output=<file>  specify target file for alignment beads; if absent, data written to STDOUT
  -H, --dummy-header               include a "dummy" header line at the start of the output, before alignment beads
  -s <ISO>, --source-lang=<ISO>    ISO 639-1 or 639-2 code for the source language (if unspecified, this will be guessed)
  -w, --write-text                 write monolingual text files with auto-generated filenames, as well as bead data
  -t <att>, --text-attribute=<att> name for the s-attribute to use for each text (defaults to just <text>)
  -g <att>, --grid-attribute=<att> name for the s-attribute to use as alignment grid (defaults to <seg>, as in TMX)
  -v, --verbose                    show progress messages during processing
  -h, --help                       display short help page

DESCRIPTION

This is a CWB support tool which can be used to export parallel corpus data in the TMX format to files suitable for use in CWB.

TMX (Translation Memory eXchange) is an XML-based standard for storing aligned bilingual data (or multilingual, but this tool only deals with bilingual right now). As of the time of writing, the TMX standard is maintained at https://www.gala-global.org/lisa-oscar-standards; documentation with examples can be found here: https://www.gala-global.org/tmx-14b . Multiple proprietary software packages for aligning parallel corpus data store their results as TMX.

However, TMX data operates on a very different principle to CWB-indexed corpora. When using CWB, a separate corpus is indexed for the source-language corpus and the target-language corpus; then, some structural attribute is used to identify stretches of text which should be linked together between the first and second corpora as aligned; this data is used to create an alignment attribute which indexes those interlinks. In TMX format, by contrast, each text in the parallel corpus is stored in a single bilingual XML file; the corresponding regions of text are placed adjacent to one another within a grouping XML element. In other words, a TMX file consists of alternating chunks in the two languages.

cwb-align-tmx2beads is designed to take a set of TMX files and to generate a file re-expressing the alignment information in a form that CWB can use (specifically, cwb-align-import) to generate the necessary a-attribute(s) - a format that encodes what in CWB terms are called alignment beads. cwb-align-tmx2beads can also (optionally, but normally you would want to) create a pair of text files for each original TMX file, where the two files contain the source and the target language text separately; the separated-out source language data and target language data can be then tagged or tokenised and ultimately indexed as CWB corpora.

The output, printed to STDOUT by default, is designed to generate an input file for cwb-align-import: it contains one alignment bead per row, specifying first the ID(s) of the source language region(s) and then the ID(s) of the target language region(s). A full header row is not included in the output; a "dummy" header can be included with the -H option, but this contains placeholders which the user must replace manually before using the data. See "OUTPUT FILE FORMAT" below.

An example illustrating a typical use case can be found in the CWB Corpus Encoding Tutorial. (NB - this is a TODO!)

OPTIONS

--help, -h

Show usage and options summary.

--verbose, -v

Verbose mode (shows progress messages during processing).

--tmx-input=file, -i file

Input filename: path to a TMX data file to process. This must be specified at least once, but can be specified any number of times.

--bead-output=file, -o file

Output filename: if specified, the alignment data will be written to the file in question, otherwise it will be piped to STDOUT. See "OUTPUT FILE FORMAT" for its format. The conventional file extension is .align.

--dummy-header, -H

Cause output to begin with a "dummy" header line; see "OUTPUT FILE FORMAT" for more details. By default, the output contains no header line - just the sequence of lines representing alignment beads.

--source-lang=ISO, -s ISO

Specify the source language. Languages need to be declared using ISO-639 codes, either the two or three-letter version (but it must match what is in the TMX file(s)). It's best always to specify this, but if you don't, cwb-align-tmx2beads will try to guess from the content of your TMX files.

The language codes are case-insensitive; EN and en would have exactly the same effect.

--write-text, -w

Write monolingual files containing the original text.

If this flag is present, then for each input file, a pair of text files will be generated in the current working directory (their names distinguished by the ISO-639 language codes); while the files are text/XML as produced, they can be tokenised and optionally tagged separately, in order to produce indexable .vrt files for CWB corpus creation; they contain XML seg elements with identifier attributes (id) that correspond to those in the alignment bead file; the root element is text (or specify otherwise with -t), and has an id attribute which is derived from the TMX filename.

If you intend to retain the auto-generated text ID attribute as the eventual CWB IDs for the texts, be aware that to be compatible with CQPweb, the part of the filenames that is incorporated into the text IDs should include only unaccented Latin letters, the digits 0 to 9, and the underscore character.

--text-attribute=attribute, -t attribute

Specify name for the text-level XML element (CWB structural attribute).

The files created with -w or --write-text are, by default, of the following overall form:

   <text id="myfile_en">
      ... segments representing translation units here ...
   </text>

If you want a different root element, you can use this option to set it. It must be a valid CWB identifier (and you should make sure it is preserved, along with its ID, when you tokenise and/or tag and/or index the corpus files).

--grid-attribute=attribute, -g attribute

Specify the grid attribute to use in text files.

Alignment data in CWB relies on a single structural attribute (represented in input text as a set of nonempty instances of the attribute's namesake XML element) used as the alignment grid. This s-attribute must have the same name in both the source and target language corpora; each alignment bead links n consecutive grid regions in the source language to m consecutive grid regions in the target language.

Traditionally, the typical grid attribute is s (for sentence alignment). In TMX, however, aligned elements are contained within <seg>...</seg>. By default, text files produced using the -w / --write-text flag preserve the <seg> tags from the TMX data. However, you can use the -g option to specify an alternative name for the grid attribute. For instance, if you specify --grid=alignedChunk then the output files will contain lines like this:

   <alignedChunk id="myfile_en34">I visit London. Then I visit Paris.</alignedChunk>

      ... and in the corresponding target-language file ...

   <alignedChunk id="myfile_fr32">Je visite Londres et puis je visite Paris.</alignedChunk>

instead of forms like this:

   <seg id="myfile_en34">I visit London. Then I visit Paris.</seg>

   <seg id="myfile_fr32">Je visite Londres et puis je visite Paris.</seg>

These XML elements (whether <seg> ... </seg> or anything else) elements must be preserved, along with their IDs, when you tokenise and/or tag and/or index the corpus files; if not, then CWB will not be able to anchor the alignment data to the actual text.

INPUT FILE FORMAT

This program deals with one frequently-seen variant of the TMX format ("Translation Memory eXchange"). TMX is an XML-based language; it contains a sequence of XML elements that represent correspondence units. Every such unit contains a segment of text in each of both the source and target language, such that the target-language segment is what has been identified as the translation of the source language segment. That is, it aligns regions of the source and target texts by placing them adjacently in the structure of the TMX XML tree.

The TMX format will not be explained further here. The TMX standard is, as of this writing, online at https://www.gala-global.org/lisa-oscar-standards ; documentation with examples can be found here: https://www.gala-global.org/tmx-14b; several versions of the TMX DTD exist, of which the most recent seems to be 1.5 circa 2011, downloadable here: https://sourceforge.net/projects/tmx/files/.

OUTPUT FILE FORMAT

The output from this tool is designed to be used as input for cwb-align-import. It therefore follows the format described in the manual page for cwb-align-import, but without using the more advanced features described there. In particular, cwb-align-tmx2beads cannot generate a full header row for its output - because the header must specify the source and target corpora, which is information that cwb-align-tmx2beads does not have. There are two ways that cwb-align-tmx2beads can deal with this. By default, the header row is omitted completely. In this case, it is necessary to use cwb-align-import options to specify the information that would otherwise have been in the header line.

Alternatively, if the -H / --dummy-header flag is set, a "dummy" header line, that is one that has placeholders (of the form DUMMY_SourceLangCorpusCwbName for the CWB names of the two corpora, will be written before the rest of the output; you then need only adjust the placeholders to the correct corpus names and the file will be ready to use with cwb-align-import.

Without -H, you can manually add a header line later, in the following form:

   SOURCE_CORPUS     TARGET_CORPUS      s      {s_id}

... with the appropriate grid attribute in place of s.

Each of the remaining lines in the output corresponds to a single alignment bead (that is, a single <tuv> element in the TMX). It consists of the ID of a region of the grid attribute in the source corpus, followed by a TAB character, and then the ID of the aligned grid region in the target corpus. For example, an output line might look like this:

    mytext_en1 mytext_en2  <TAB>  mytext_fr4 

This indicates that the source corpus translation segments with IDs mytext_en1 and mytext_en2 are aligned to the target corpus translation segment with ID \mytext_fr4 (where both originate from the file myfile.tmx). The IDs are created automatically, and appear in the text files created when the -w / --write-text flag is set.

SEE ALSO

The manual for cwb-align-import goes into much more detail on the format of the output file (man cwb-align-import).

COPYRIGHT

Copyright (C) 2018-2022 Corpus Workbench contributors (see file AUTHORS)

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.