The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

cwb-align-export - Export existing sentence alignment from a CWB corpus

SYNOPSIS

  cwb-align-export [options] <SOURCE> <TARGET> <grid> <keyspec>

  <SOURCE>    CWB name of source corpus
  <TARGET>    CWB name of target corpus
  <grid>      s-attribute containing the alignment grid (usually "s")
  <keyspec>   pattern used to construct unique IDs for grid regions

Options:

  -r <dir>, --registry=<dir>    use registry directory <dir>
  -o <file>, --output=<file>    write output to <file>
  -nh, --no-header              write alignment file without header
  -f, --force                   skip alignment beads with errors rather than stopping
  -v, --verbose                 show progress messages during processing
  -h, --help                    display short help page

DESCRIPTION

This script exports an encoded sentence-level alignment between two CWB corpora (SOURCE and TARGET) as a text file compatible with cwb-align-import. In the output, alignment beads are specified by (sets of) unique sentence IDs in the source and target corpus. Unique IDs are computed from one or more s-attributes according to the pattern keyspec. Alignments at other granularities (such as paragraph or clause) are also supported; the corresponding s-attribute is specified by the command-line argument grid.

It is recommended to read the cwb-align-import manpage first, in order to get a better understanding of the export file format and its correspondence to a CWB alignment attribute. An illustrative example can be found in the CWB Corpus Encoding Tutorial.

ARGUMENTS

SOURCE

CWB corpus ID of the source language corpus.

TARGET

CWB corpus ID of the target language corpus.

grid

CWB attribute representing the alignment grid, i.e. each alignment bead links n consecutive grid regions in the source language to m consecutive grid regions in the target language. It is an error if the start or end of an alignment bead region doesn't match a corresponding grid boundary.

For the most common case of sentence alignment, grid will usually be set to s. Note that the same grid attribute is used for both source and target language corpus.

keyspec

Pattern used to construct unique sentence IDs (both in the source and target corpus). If sentences are directly annotated with IDs, say in the s-attribute s_id, you can simply specify {s_id} or {id} for short (the name of the grid attribute is automatically prepended). Note the curly braces around the attribute name!

In more complex situations, keyspec is an arbitrary character string that interpolates s-attributes enclosed in curly braces. For example, if paragraphs and sentences are numbered (s-attributes p_num and s_num), you can construct IDs of the form id_p4_s2 (second sentence in fourth paragraph) with the pattern id_p{p_num}_s{s_num}.

OPTIONS

--registry=dir, -r dir

Locate corpora in CWB registry directory dir, overriding the default directory and the environment variable CORPUS_REGISTRY.

--output=file, -o file

Write export data to file rather than standard output. Files with extension .gz or .bz2 are compressed automatically.

--no-header, -nh

Write alignment file without the optional header line (see "EXPORT FILE FORMAT" below).

--force, -f

Ignore errors and continue exporting. If the start or end point of an alignment bead doesn't match grid boundaries, the bead will be skipped with an error message.

--verbose, -v

Verbose mode (shows progress messages during processing).

--help, -h

Show usage and options summary.

EXPORT FILE FORMAT

The exported alignment file starts with an optional header line specifying the CWB names of source and target corpus, the s-attribute containing sentence regions (or other regions used as an alignment grid), and the key pattern for constructing unique sentence IDs. The four items are separated by TAB characters. Specify --no-header) to omit the header line from the export file.

Each of the remaining lines in the export file corresponds to a single alignment bead. It consists of the ID of a sentence in the source corpus (or multiple IDs separated by blanks), followed by a TAB character and the ID of the aligned sentence in the target corpus (or multiple IDs separated by blanks).

See the cwb-align-import manpage for a more detailed description of the file format and the specification of unique IDs.

COPYRIGHT

Copyright (C) 2007-2022 Stephanie Evert [https://purl.org/stephanie.evert]

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.