The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

cwb-align-import - Import existing sentence alignment into a CWB corpus

SYNOPSIS

  cwb-align-import [options] <alignment_beads.txt>

Options:

  -r <dir>, --registry=<dir>    use registry directory <dir>
  -i, --inverse                 encode inverse alignment (target -> source)
  -p, --prune                   ignore alignment beads with ID errors
  -e, --empty                   allow 1:0 and 0:1 alignments (not encoded)
  -v, --verbose                 show progress messages during processing
  -h, --help                    display short help page

  -nh, --no-header              alignment file without header; must specify:
  -l1 <name>, --source=<name>   CWB name of source corpus
  -l2 <name>, --target=<name>   CWB name of target corpus
  -s <att>,   --grid=<att>      alignment grid (s-attribute, usually sentences)
  -k <spec>,  --key=<spec>      pattern for constructing unique sentence IDs

DESCRIPTION

This script is used to import an existing sentence-level alignment between two corpora into the IMS Open Corpus Workbench. Alignments at other granularities (such as paragraph or clause) are also supported, but note that discontinuous regions are not allowed in alignment beads and there can only be one alignment between any two corpora. For simplicity, this manpage talks about sentence alignment only; adjust the instructions accordingly to import paragraph alignment etc.

First, the two corpora to be aligned must be CWB-encoded, making sure that all sentence regions (usually marked by an s-attribute s) carry unique IDs. For efficiency and convenience, IDs can be constructed from multiple annotations at different levels, e.g. document ID, paragraph number (within document) and sentence number (within paragraph).

The input file for cwb-align-import contains one alignment bead per row, specifying first the ID(s) of the source language sentence(s) and then the ID(s) of the target language sentence(s). The CWB names of the source and target corpus, the s-attribute containing sentence regions, and a pattern for constructing unique IDs are either listed in the header of the input file, or can be specified with command-line arguments. See "INPUT FILE FORMAT" below for details on the alignment file format.

An example illustrating a typical use case can be found in the CWB Corpus Encoding Tutorial.

OPTIONS

--help, -h

Show usage and options summary.

--verbose, -v

Verbose mode (shows progress messages during processing).

--registry=dir, -r dir

Locate corpora in CWB registry directory dir, overriding the default directory and the environment variable CORPUS_REGISTRY.

--inverse, -i

Encode inverse alignment (from target language to source language).

--prune, -p

Automatically ignore alignment beads if sentence IDs are not found, either in the source or the target corpus. Without -p, cwb-align-import will abort with an error message in this case. Note that the -p option implies -e (see below).

--empty, -e

Allow 1:0 and 0:1 alignment beads, which will be silently ignored (without -e, they cause a fatal error).

--no-header, -nh

Alignment file does not contain a header line. In this case, the header information must be provided on the command line with the -l1, -l2, -s and -k flags (documented below).

--source=ID, -l1 ID

CWB corpus ID of the source language corpus. Overrides information in alignment file header, if present.

--target=ID, -l2 ID

CWB corpus ID of the target language corpus. Overrides information in alignment file header, if present.

--grid=attribute, -s attribute

CWB attribute used as alignment grid (each alignment bead links n consecutive grid regions in the source language to m consecutive grid regions in the target language). For the most common case of sentence alignment, the grid attribute will usually be s. Note that the same attribute name is used for both source and target language corpus.

--key=pattern, -k pattern

Pattern used to construct unique sentence IDs (must be the same in source and target language). If sentences are directly annotated with IDs, say in the s-attribute s_id, you can simply specify -k '{s_id}' or -k '{id}' for short (the name of the grid attribute is automatically prepended). Note the curly braces around the attribute name!

In more complex situations, pattern is an arbitrary character string that interpolates s-attributes enclosed in curly braces. For example, if paragraphs and sentences are numbered (s-attributes p_num and s_num), you can construct IDs of the form id_p4_s2 (second sentence in fourth paragraph) with the key -k 'id_p{p_num}_s{s_num}'.

INPUT FILE FORMAT

The alignment file starts with an optional header line. Use the -nh (--no-header) flag if your file does not include a header. In this case you need to specify the necessary information on the command line, using the -l1, -l2, -s and -k flags.

The header specifies the corpora to be aligned, which must already have been CWB-encoded; the name of the s-attribute containing sentence regions (or other regions used as an alignment grid, such as paragraphs); and finally a key pattern to be used for constructing unique sentence IDs (see below). The four items must be separated by TAB characters.

In the simplest case, the ID key pattern is a string that includes the name of the s-attribute containing sentence IDs enclosed in curly braces. For instance, if your corpus includes unique sentence numbers

  <s num="1">

encoded in the standard way (i.e. with -S s+num), and if sentence IDs are given in the form id_1 in the alignment beads file, you have to specify id_{s_num} (or id_{num} for short) as key pattern. The entire header line might thus look like this:

   SOURCE_CORPUS     TARGET_CORPUS      s      id_{num}

Each of the remaining lines in the input file corresponds to a single alignment bead. It consists of the ID of a sentence in the source corpus (or multiple IDs separated by blanks), followed by a TAB character and the ID of the aligned sentence in the target corpus (or multiple IDs separated by blanks).

As an example, the line

    id_1 id_2 id_3  <TAB>  id_2 id_3  [ <TAB> ... ]

indicates that the first three sentences of the source corpus are aligned to the second and third sentence of the target corpus. Further TAB-delimited fields (e.g. specifying the confidence in an alignment bead) are silently ignored.

Unique sentence IDs can also be constructed from multiple attributes, which is more efficient than storing the unique IDs in a single CWB attribute. The individual components are usually XML attributes of the <s> tags, but may also include XML attributes of different regions such as text IDs or paragraph numbers. In the latter case, the full name of the corresponding s-attribute has to be specified. For instance, {text_id}:{num} would construct unique IDs from a text ID and sentence number, separated by a : character.

COPYRIGHT

Copyright (C) 2007-2022 Stephanie Evert [https://purl.org/stephanie.evert]

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.