NAME

CWB::CL - Perl interface to the low-level C API of the IMS Open Corpus Workbench

SYNOPSIS

  use CWB::CL;

  print "Registry path = ", $CWB::CL::Registry, "\n";
  $CWB::CL::Registry .= ":/home/my_registry";    # add your own registry directory

  # "strict" mode aborts if any error occurs (convenient in one-off scripts)
  CWB::CL::strict(1);                            # or simply load CWB::CL::Strict module
  CWB::CL::set_debug_level('some');              # 'some', 'all' or 'none' (default)
  CWB::CL::set_optimizer(1);                     # enable experimental optimizations in CL (if any)

  CWB::CL::error_message();                      # error message for last method call (or "")

  # CWB::CL::Corpus objects
  $corpus = new CWB::CL::Corpus "EUROPARL-EN";   # name of corpus can be upper or lower case
  die "Error: can't access corpus EUROPARL-EN"   # all error conditions return undef
    unless defined $corpus;                      #   (checks are not needed in "strict" mode)
  undef $corpus;                                 # currently a no-op (CL implementation is buggy)

  $charset = $corpus->charset;                   # declared character encoding of the corpus
  $folded = $corpus->normalize("cd", $string);   # CWB-compatible case- and diacritic-folding
                                                 # (use "n" for UTF-8 strings from external sources)

  # CWB::CL::PosAttrib objects (positional attributes)
  $lemma = $corpus->attribute("lemma", 'p');     # returns CWB::CL::PosAttrib object
  $corpus_length = $lemma->max_cpos;             # valid cpos values are 0 .. $corpus_length-1
  $lexicon_size = $lemma->max_id;                # valid id values are 0 .. $lexicon_size-1

  $id  = $lemma->str2id($string);                # lookup lexicon ID of type $string
  @idlist = $lemma->str2id(@strlist);            # (all scalar functions map to lists in list context)
  $str = $lemma->id2str($id);                    # type with lexicon ID $id
  $len = $lemma->id2strlen($id);                 # string length of type with ID $id
  $f   = $lemma->id2freq($id);                   # corpus frequency of type with ID $id
  $id  = $lemma->cpos2id($cpos);                 # lexicon ID of value at corpus position $cpos
  $str = $lemma->cpos2str($cpos);                # type annotated at corpus position $cpos

  @idlist = $lemma->regex2id($re);               # find all lexicon IDs matching regular expression
  @idlist = $lemma->regex2id($re, 'cd');         #   with optional flags 'n', 'c', 'd'
  @cpos = $lemma->idlist2cpos(@idlist);          # occurrences of all types in @idlist
  $total_freq = $lemma->idlist2freq(@idlist);    # total corpus frequency of @idlist (w/o decoding index)


  # CWB::CL::StrucAttrib objects (structural attributes)
  $chapter = $corpus->attribute("chapter", 's'); # returns CWB::CL::StrucAttrib object
  $number_of_regions = $chapter->max_struc;      # valid region numbers are 0 .. $number_of_regions-1
  $has_values = $chapter->struc_values;          # are regions annotated with strings?

  $struc = $chapter->cpos2struc($cpos);          # number of <chapter> region containing $cpos (or undef)
  ($start, $end) = $chapter->struc2cpos($struc); # start and end of region number $struc
  @pairs = $chapter->struc2cpos(@struc_list);    # returns flat list ($s1, $e1, $s2, $e2, ...)
  $str  = $chapter->struc2str($struc);           # annotation string for region number $struc (or undef)
  $str  = $chapter->cpos2str($cpos);             # annotation string for region around $cpos (or undef)

  ($s, $e) = $chapter->cpos2struc2cpos($cpos);   # start/end of <chapter> region around $cpos
  @pairs = $chapter->cpos2struc2cpos(@cpos_list);# returns 2 * N values for N arguments (cf. above)

  # check whether corpus position is at boundary (l, r, lr) or inside/outside (i/o) of region
  if ($chapter->cpos2boundary($cpos) & $CWB::CL::Boundary{'l'}) { ... }
  if ($chapter->cpos2is_boundary('l', $cpos)) { ... }


  # CWB::CL::AlignAttrib objects (alignment attributes)
  $ger = $corpus->attribute("europarl-de", 'a'); # returns CWB::CL::AlignAttrib object
  $nr_of_beads = $ger->max_alg;                  # alignment bead numbers are 0 .. $nr_of_beads-1
  if ($ger->has_extended_alignment) { ... }      # extended alignment allows gaps & crossing alignments
  
  $bead = $ger->cpos2alg($cpos);                 # alignment bead containing $cpos (or undef)
  ($src_start, $src_end, $tgt_start, $tgt_end)   # aligned spans in source and target corpus
      = $ger->alg2cpos($bead);
  @quads = $ger->alg2cpos(@bead_list);           # flat list of quadruplets (one for each alignment bead)
  @quads = $ger->cpos2alg2cpos(@cpos_list);      # find alignments (source/target spans) for corpus position(s)


  # Feature sets (can be used as values of positional and structural attributes)
  $np_f = $corpus->attribute("np_feat", 's');    # p- and s-attributes can store feature sets
  $fs_string = $np_f->cpos2str($cpos);           # feature sets are encoded as strings
  $fs  = CWB::CL::set2hash($fs_string);          # expand feature set into hash (returns hashref)
  if (exists $fs->{"paren"}) { ... }
  $fs1 = CWB::CL::make_set("|proper|nogen|");    # validate feature set (reorders values)
  $fs2 = CWB::CL::make_set("paren nogen proper", 'split'); # or construct from blank-delimited string
  $fs3 = CWB::CL::make_set($fs);                           # or from hash reference
  $fs  = CWB::CL::set_intersection($fs1, $fs2);  # intersection of feature set values
  $n   = CWB::CL::set_size($fs);                 # size of feature set

DESCRIPTION

This module provides an interface to the low-level Corpus Library for accessing CWB-indexed corpora. It follows the Corpus Library API closely, except for an object-oriented design with simplified method names and the addition of a few convenience methods.

All scalar access methods - usually named xxx2yyy - are vectorized: they automatically map to multiple input arguments and return a flat list of results. Vectorization is implemented in C code, ensuring high performance.

All errors and out-of-bounds accesses are turned into undefined values (undef) unless strict mode is enabled (e.g. with use CWB::CL::strict;). If an item is not found - e.g. a given type string is not in the lexicon of a p-attribute, or a given corpus position is not within an s-attribute region - the method will also return undef. Vectorized method calls may return a mixture of defined and undefined values.

CWB3 DATA MODEL

CWB is based on a tabular data model, which represents a corpus as a sequence of tokens annotated with one or more string values in the form of an annotation table.

      word    pos     lemma
      ----    ---     -----
  0   Dogs    NNS     dog
  1   like    VBP     like
  2   cats    NNS     cat
  3   .       SENT    .
  4   Cats    NNS     cat
  5   do      VBP     do
  6   n't     RB      not
  7   like    VB      like
  8   dogs    NNS     dog
  9   .       SENT    .

Tokens are identified by their row number starting from 0, which is known as corpus position (or cpos) for short. The first column of the table, always labelled word, contains the surface forms of the tokens. Further columns, which can be labelled with arbitrary ASCII identifiers, contain token-level annotations (in this case part-of-speech tags (pos) and lemmatization (lemma)). Each table column forms a separate positional attribute (or p-attribute for short) in the CWB data model. The token sequence itself is thus a regular p-attribute with the special name word.

For the sake of efficiency and data compression, p-attributes use a numeric indexing scheme based on a lexicon of all distinct annotation strings (types), which are assigned numeric IDs starting from 0. Each p-attribute has its own lexicon, e.g. for pos with the types 0 = NNS, 1 = VBP, 2 = SENT, 3 = RB and 4 = VB.

CWB::CL provides methods for mapping between corpus positions, lexicon IDs, type strings and type frequencies.

XML tags in CWB input files are stored as structural attributes (or s-attributes for short). Each s-attribute indexes a sequence of non-overlapping, non-nested regions corresponding to XML elements of the same name. Consider this example:

      <text title="All about dogs">
      <s n="1" words="3">
  0   Dogs    NNS     dog
  1   like    VBP     like
  2   cats    NNS     cat
  3   .       SENT    .
      </s>
      <s n="2" words="4">
  4   Cats    NNS     cat
  5   do      VBP     do
  6   n't     RB      not
  7   like    VB      like
  8   dogs    NNS     dog
  9   .       SENT    .
      </s>
      </text>

Note that there are no separate corpus positions assigned to XML tags, which are positioned at boundaries between tokens. The single <text> region is stored in an s-attribute named text; the two <s> regions are stored in an s-attributed named s. Attribute-value pairs in XML start tags are converted to additional s-attributes text_title, s_n and s_words.

Each s-attribute region is represented by its start and end corpus position, e.g. (0, 3) for the first sentence and (4, 9) for the second sentence above. The regions are numbered starting from 0; such region numbers are referred to as struc in method names.

If an s-attribute represents annotation in XML start tags, its regions are annotated with string values (e.g. "3" and "4" for the two regions of s-attribute s_words). These strings are not indexed with the help of a lexicon, so access is much less efficient than for p-attributes.

CWB::CL provides methods to access the span and annotation of an s-attribute region, to find the region number containing a given cpos and to test for the start or end of a region.

Sentence-level alignment between two different corpora is represented by alignment attributes (or a-attributes for short). The name of an alignment attribute corresponds to the CWB ID of the target corpus in lowercase; as a consequence, there can only be a single alignment for each pair of source and target corpus. An a-attribute indexes a sequence of alignment beads that connect a token span (src_start, src_end) in the source corpus with a span (tgt_start, tgt_end) in the target corpus. These spans need not correspond to sentence regions.

Alignment beads are numbered starting from 0, in the order of their positions in the source corpus. Both the source spans and the target spans must be non-overlapping and must not be nested. Most alignment attributes will use this new-style "extended" format. Only some legacy corpora may contain old-style a-attributes, which do not allow for crossing alignments or gaps between beads.

CWB::CL provides methods to access the source and target spans of an alignment bead, and to find the bead number containing a given corpus position in the source corpus.

Global Configuration and Utilities

$CWB::CL::Registry

Path to CWB registry directory, or multiple paths separated by colons (:). This variable can be modified to change the registry in which corpora will be searched. It does not affect CWB::CL::Corpus objects that have already been created.

$error = CWB::CL::error_message();

Human-readable error message for an error encountered during the last method call. If the call was successful, an empty string is returned.

CWB::CL::strict(1);

Enable strict mode, so that the Perl script will immediately be terminated if there is any error or invalid access (instead of returning undef values). Strict mode can also be enabled by importing the module as use CWB::CL::Strict;).

Strict mode is a convenience feature for one-off scripts and command-line tools run by end users. Production software should keep strict mode disabled and check all return values instead.

CWB::CL::set_debug_level($lvl)

Set the amount of debugging information printed on stderr by the Corpus Library. Admissible values for $lvl are 0 or none (no output), 1 or some (some messages), 2 or all (all messages).

CWB::CL::set_optimize(1);

Enable experimental optimizations in the Corpus Library.

Stable releases (including v3.5) do not contain any experimental optimizations, so this option has no effect at present.

Feature Sets

Feature set annotation uses a special string notation for sets of feature values. The individual values in the set are sorted in CWB order, separated by pipe characters (|) and enclosed in pipe characters. For example, the set {small, medium, big} is represented by the string

  |big|medium|small|

and the empty set by

  |

Keep in mind that there must not be any duplicate values in a set. Features sets can be used as annotation values for p-attributes and s-attributes. The Corpus Query Processor (CQP) provides special operators contains and matches for searching feature sets with regular expressions, as well as functions for computing set size (ambiguity()) and set intersection (unify()).

CWB::CL offers some convenience functions for creating and manipulating feature sets. These functions are implemented in C code for efficiency.

$fs = CWB::CL::make_set($values [, 's']);

Create a feature set from $values, which is either a string in feature set notation or a hashref. In the first case, correct notation is checked and the values are sorted if necessary (CWB will automatically add the surrounding delimiters if need be). In the second case, a feature set is constructed from the keys of the hash %$values.

If a second argument s (or split) is passed, the string $value is split on whitespace.

$fs = CWB::CL::set_intersection($fs1, $fs2);

Compute the intersection of two feature sets $fs1 and $fs2, i.e. a feature set containing all shared values. This function only works correctly if both arguments are sorted and use valid feature set notation. It correspond to the unify() function in CQP.

$n = CWB::CL::set_size($fs);

Return the cardinality of a feature set $fs, i.e. the number of elements. This function only works correctly if $fs uses valid feature set notation. It corresponds to the ambiguity() function in CQP.

$values = CWB::CL::set2hash($fs);

Expand feature set $fs in CWB notation into a hash, with elements as keys and values set to 1. Returns a hashref $values.

Corpora (CWB::CL::Corpus)

Each CWB corpus is represented by a CWB::CL::Corpus object. The object constructor locates a suitable registry file and accesses the corpus. Attribute handles are then obtained with the attribute method.

$corpus = new CWB::CL::Corpus $ID;

Access corpus with CWB ID $ID, usually specified in uppercase letters. The constructor looks for a registry file in the path(s) specified by $CWB::CL::Registry. Returns a corpus handle, i.e. an object of class CWB::CL::Corpus, or undef if the corpus cannot be found (unless strict mode is enabled).

$att = $corpus->attribute($name, $type);

Obtain attribute handle for the attribute with name $name and type $type (p = positional, s = structural, a = alignment). Note that legacy corpora may contain attributes of different types with the same name, even though this has been deprecated. Returns undef if the attribute does not exist (unless strict mode is enabled).

Classes for handles of different attribute types and their access methods are described below.

@names = $corpus->list_attributes([$type]);

Returns the names of all attributes defined for $corpus. Attribute names will be listed in the same order as in the registry file.

If $type is specified, only list attributes of the selected type (p, s or a).

$folded = $corpus->normalize($flags, $string);
@folded = $corpus->normalize($flags, @strings);

Normalize one or more strings according to $flags, which is any combination of the flags below in the specified order.

  n   normalize UTF-8 strings to CWB canonical form (NFC)
  c   fold strings to lowercase
  d   remove all diacritics (combining marks)

Admissible values for $type are thus c, d, cd, n, nc, nd and ncd. Note that normalize is a method because it depends on the character encoding of the corpus.

$charset = $corpus->charset;

Character encoding of $corpus (using CWB notation, same as in registry files). Typical values are utf8, latin1 and ascii.

Positional Attributes (CWB::CL::PosAttrib)

Handles for p-attributes are represented by objects of class CWB::CL::PosAttrib. They should never be constructed directly, but rather obtained from the attribute method of a corpus handle.

$N = $att->max_cpos;

Returns the number of tokens in the corpus (which is technically a property of each p-attribute). Note that the name of the function is misleading: valid corpus positions range from 0 to $N-1.

$V = $att->max_id;

Returns the number of distinct types in the lexicon of the p-attribute. Note that the name of the function is misleading: valid type IDs range from 0 to $V-1.

$type = $att->id2str($id);
@types = $att->id2str(@ids);

Find type (string) corresponding to numerical lexicon $id. Returns undef for lexicon IDs that are out of range and all other errors.

$len = $att->id2strlen($id);
@lens = $att->id2strlen(@ids);

Returns length of type string corresponding to numerical lexicon $id, measured in bytes. This method is provided for consistency with the Corpus Library API, where it determines string length efficiently without having to scan the string. Its Perl complement has no speed benefit and the id2str method should be preferred.

$f = $att->id2freq($id);
@fs = $att->id2freq(@ids);

Returns corpus frequency of the type with numerical lexicon ID $id (undef for lexicon IDs that are out of range and all other errors).

$id = $att->str2id($type);
@ids = $att->str2id(@types);

Search $type (string) in lexicon and return its ID if successful. Returns undef for all types not found in the lexicon and for all errors. An out-of-vocabulary $type is not an error and will return undef even in strict mode.

$id = $att->cpos2id($cpos);
@ids = $att->cpos2id(@cpos);

Returns the lexicon ID of the annotation at corpus position $cpos (undef if $cpos is out of range and all other errors).

$type = $att->cpos2str($cpos);
@types = $att->cpos2str(@cpos);

Returns the type string annotated at corpus position $cpos (undef if $cpos is out of range and all other errors).

This method is equivalent to

  @types = $att->id2str($att->cpos2id(@cpos));

but faster and it does not have to allocate memory for the intermediate result. It is very convenient for displaying parts of the corpus text.

@ids = $att->regex2id($rx[, $flags]);

Scan lexicon of $att with regular expression $rx and return the lexicon IDs of all matching types. $rx always has to match the full type string; start and end anchors are not required. The Corpus Library uses PCRE regular expressions, so the two lines below are mostly equivalent:

  @types = $att->id2str($att->regex2id($rx));

  @types = grep { /^($rx)$/ } $att->id2str(0 .. ($att->max_id - 1));

However, there will be differences in some corner cases, e.g. case-insensitive matching for non-ASCII characters.

The optional argument $flags consists of any combination of the flags below in the specified order.

  n   normalize $rx to CWB canonical form (NFC)
  c   case-insensitive
  d   ignore diacritics (combining marks)

Admissible values for $type are thus c, d, cd, n, nc, nd and ncd. The n flag is highly-recommended for regular expressions provided by users.

regex2id returns an empty list if $rx does not match any types or if there are any errors, in particular in case of an invalid regular expression. Unless strict modes is enabled, Perl scripts need to check CWB::CL::error_message() in order to catch syntax errors in $rx.

$f = $att->idlist2freq(@ids);

Returns the total corpus frequency of all type IDs in the list @ids (undef if any of the lexicon IDs is out of range or another error occurs). Equivalent to

  use List::Util qw(sum);
  $f = sum($att->id2freq(@ids));

but much faster because the summation is carried out in C code.

@cpos = $att->idlist2cpos(@ids);

Look up all corpus positions annotated with one of the type IDs in @ids, merged into a single numerically sorted list. Returns an empty list if there is any error.

There is no separate method for the occurrences of a single type $id, but idlist2cpos recognises this special case and uses more efficient code (because the occurrences can be looked up directly in the inverted index). The undocumented method id2cpos is simply an alias for idlist2cpos.

Structural Attributes (CWB::CL::StrucAttrib)

Handles for s-attributes are represented by objects of class CWB::CL::StrucAttrib. They should never be constructed directly, but rather obtained from the attribute method of a corpus handle.

$n = $att->max_struc;

Returns the total number of regions for the s-attribute. Note that the name of the function is misleading: valid region numbers range from 0 to $n-1.

$has_values = $att->struc_values;

Returns TRUE if regions of this s-attribute are annotated with string values.

$struc = $att->cpos2struc($cpos);
@strucs = $att->cpos2struc(@cpos);

Returns the number of the region containing corpus position $cpos, or undef if $cpos is not inside a region of this s-attribute (and in case of any errors, including out-of-bounds $cpos).

It is not an error for $cpos to be outside a region, so undef will be returned even in strict mode.

$value = $att->struc2str($struc);
@values = $att->struc2str(@strucs);

Obtain the string value that region number $struc is annotated with. Returns undef in case of any error, in particular if $att->struc_values is FALSE.

Note that there is no method to search regions for a particular annotation string or regular expression. Scripts will have to loop over all regions in the s-attribute and carry out such tests in Perl code.

$value = $att->cpos2str($cpos);
@values = $att->cpos2str(@cpos);

Obtain the string value annotation of the region containing corpus position $cpos. Returns undef if $cpos is not inside any region of the s-attribute (and in case of any errors, in particular if $att->struc_values is FALSE). It is not an error for $cpos to be outside a region, so undef will be returned even in strict mode.

This method is fully equivalent to

  @values = $att->struc2str($att->cpos2struc(@cpos));

but is faster and more convenient if the region numbers are not needed otherwise. An alias cpos2struc2str is provided for consistency with the Corpus Library API, but cpos2str is the preferred form.

($start, $end) = $att->struc2cpos($struc);
@pairs = $att->struc2cpos(@strucs);

Returns start and end corpus position of region number $struc, or (undef, undef) if there is any error.

If multiple region numbers are supplied, a flast list of start/end pairs is returned (possibly containing pairs of undefs). For example, the call @pairs = $att->struc2cpos($n1, $n2, $n3); returns

  @pairs = ($s1, $e1, $s2, $e2, $s3, $e3);
($start, $end) = $att->cpos2struc2cpos($cpos);
@pairs = $att->cpos2struc2cpos(@cpos);

Returns start and end corpus position of the region containing corpus position $cpos, or (undef, undef) if $cpos is not within a region of the s-attribute (and for any error). For multiple @cpos, the method returns a flat list of start/end pairs like struc2cpos.

It is not an error for $cpos to be outside a region, so (undef, undef) pairs will be returned even in strict mode.

if ($att->cpos2is_boundary($which, $cpos)) { ... }
@yesno = $att->cpos2is_boundary($which, @cpos);

Test whether corpus position $cpos is at the boundary of, inside or outside a region of s-attribute $att. Returns TRUE if the test succeeds, FALSE otherwise, and undef in case of an error.

The parameter $which determines which test is carried out. The following short and long codes are supported:

  i   inside      cpos is anywhere inside a region
  o   outside     cpos is not inside a region
  l   left        cpos is the first token in a region
  r   right       cpos is the last token in a region
  lr  leftright   cpos is a single-token region (first AND last)
  rl  rightleft   (same)

There is no single test for whether $cpos is either the start or the end of a region. For this and other complex tests, the method cpos2boundary can be used.

$flags = $att->cpos2boundary($cpos);
@flags = $att->cpos2boundary(@cpos);

Returns an integer $flags where several flag bits can be set indicating whether $cpos is at the left/right boundary of, and/or inside a region. Currently three bits are in use

  $CWB::CL::Boundary{"inside"}  set if $cpos is inside region
  $CWB::CL::Boundary{"left"}    set if $cpos is the first token of a region
  $CWB::CL::Boundary{"right"}   set if $cpos is the last token of a region

Use logical bit operators to test for individual flags or combinations of these flags. For example, at the start of a region both inside and left bits will be set. A $cpos outside a region returns $flags = 0. And an "inner" token inside a region (which is neither the first nor last token) has only the inside bit set. (Note: The leftright test in cpos2is_boundary checks whether all three bits are set.)

Alignment Attributes (CWB::CL::AlignAttrib)

Handles for a-attributes are represented by objects of class CWB::CL::AlignAttrib. They should never be constructed directly, but rather obtained from the attribute method of a corpus handle.

$n = $att->max_alg;

Returns the total number of alignment beads for the a-attribute. Note that the name of the function is misleading: valid bead numbers range from 0 to $n-1.

$ok = $att->has_extended_alignment;

Returns TRUE if the a-attribute uses "extended" format. There is no difference in access patterns, but the script has to expect crossing alignments and gaps between beads if $ok is TRUE. Most aligned corpora will be in extended format.

$bead = $att->cpos2alg($cpos);
@beads = $att->cpos2alg(@cpos);

Returns the number of the alignment bead containing corpus position $cpos, or undef if $cpos is not inside a bead of this a-attribute (and in case of any errors, including out-of-bounds $cpos).

It is not an error for $cpos to be outside a bead (provided that the a-attribute uses "extended" format), so undef will be returned even in strict mode.

($src_start, $src_end, $tgt_start, $tgt_end) = $att->alg2cpos($bead);
@quads = $att->alg2cpos(@beads);

Returns the aligned spans in source and target corpus for alignment bead number $bead, or (undef, undef, undef, undef) if there is any error.

If multiple bead numbers are supplied, a flast list of quadruplets is returned (possibly containing quadruplets of undefs). For example, the call @quads = $att->alg2cpos($A, $B); returns

  @quads = ($A_s1, $A_s2, $A_t1, $A_t2, $B_s1, $B_s2, $B_t1, $B_t2);
($src_start, $src_end, $tgt_start, $tgt_end) = $att->cpos2alg2cpos($cpos);
@quads = $att->cpos2alg2cpos(@cpos);

Returns the aligned source and target spans for the alignment bead containing corpus position $cpos, or (undef, undef, undef, undef) if $cpos is not inside a bead of this a-attribute (and in case of any errors, including out-of-bounds $cpos).

If multiple corpus positions are supplied, a flast list of quadruplets is returned in the same way as for alg2cpos.

EXAMPLE

The minimalistic example script below requires the DICKENS demo corpus to be installed in the standard registry path. It compiles a lemma frequency list for all <title> regions in the corpus and prints the first 20 entries. Note how it uses CWB::CL::Strict to avoid checking return values for error conditions.

  use CWB::CL::Strict;
   
  my $C = new CWB::CL::Corpus "DICKENS";
  my $Lemma = $C->attribute("lemma", "p");
  my $Title = $C->attribute("title", "s");
   
  my $n_titles = $Title->max_struc;
  my %F = ();
   
  foreach my $i (0 .. ($n_titles - 1)) {
    my ($start, $end) = $Title->struc2cpos($i);
    foreach my $lemma ($Lemma->cpos2str($start .. $end)) {
      $F{$lemma}++;
    }
  }
   
  my @lemmas = sort {$F{$b} <=> $F{$a}} keys %F;
  foreach my $lemma (@lemmas[0 .. 19]) {
    printf "%8d %s\n", $F{$lemma}, $lemma;
  }

COPYRIGHT

Copyright (C) 1999-2022 by Stephanie Evert [https://purl.org/stephanie.evert]

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.