Author image Alberto Simões 🐪


CWB::CL - Perl interface to the low-level Corpus Library of the IMS Open CWB


  use CWB::CL;

  print "Registry path = ", $CWB::CL::Registry, "\n";
  $CWB::CL::Registry .= ":/home/my_registry";    # add your own registry directory

  # "strict" mode aborts if any error occurs (convenient in one-off scripts)
  CWB::CL::strict(1);                            # or simply load CWB::CL::Strict module
  CWB::CL::set_debug_level('some');              # 'some', 'all' or 'none' (default)

  # CWB::CL::Corpus objects
  $corpus = new CWB::CL::Corpus "HANSARD-EN";    # name of corpus can be upper or lower case
  die "Error: can't access corpus HANSARD-EN"    # all error conditions return undef
    unless defined $corpus;                      #   (checks are not needed in "strict" mode)
  undef $corpus;                                 # currently, mapped memory cannot be freed

  # CWB::CL::Attribute objects (positional attributes)
  $lemma = $corpus->attribute("lemma", 'p');     # returns CWB::CL::Attribute object
  $corpus_length = $lemma->max_cpos;             # valid cpos values are 0 .. $corpus_length-1
  $lexicon_size = $lemma->max_id;                # valid id values are 0 .. $lexicon_size-1

  $id  = $lemma->str2id($string); 
  @idlist = $lemma->str2id(@strlist);            # all scalar functions map to lists in list context
  $str = $lemma->id2str($id);
  $len = $lemma->id2strlen($id);
  $f   = $lemma->id2freq($id);
  $id  = $lemma->cpos2id($cpos);
  $str = $lemma->cpos2str($cpos);

  @idlist = $lemma->regex2id($re);               # regular expression matching
  @cpos = $lemma->idlist2cpos(@idlist);          # accessing the index (occurrences of given IDs)
  $total_freq = $lemma->idlist2freq(@idlist);    # better check the list size first on large corpora

  # CWB::CL::AttStruc objects (structural attributes)
  $chapter = $corpus->attribute("chapter", 's'); # returns CWB::CL::AttStruc object
  $number_of_regions = $chapter->max_struc;      # valid region numbers are 0 .. $number_of_regions-1
  $has_values = $chapter->struc_values;          # are regions annotated with strings?

  $struc = $chapter->cpos2struc($cpos);          # returns undef if not $cpos is not in <chapter> region
  ($start, $end) = $chapter->struc2cpos($struc); # returns empty list on error -> $start is undefined
  ($start, $end) = $chapter->cpos2struc2cpos($struc);  # returns empty list if not in <chapter> region
      # returns 2 * <n> values (= <n> start/end pairs) if called with <n> arguments
  $str  = $chapter->struc2str($struc);           # always returns undef if not $chapter->struc_values
  $str  = $chapter->cpos2str($cpos);             # combines cpos2struc() and struc2str() 

  # check whether corpus position is at boundary (l, r, lr) or inside/outside (i/o) of region
  if ($chapter->cpos2boundary($cpos) & $CWB::CL::Boundary{'l'}) { ... }
  if ($chapter->cpos2is_boundary('l', $cpos)) { ... }

  # CWB::CL::AttAlign objects (alignment attributes)
  $french = $corpus->attribute("hansard-fr", 'a'); # returns CWB::CL::AttAlign object
  $nr_of_alignments = $french->max_alg;          # alignment block numbers are 0 .. $nr_of_alignments-1
  $extended = $french->has_extended_alignment;   # extended alignment allows gaps & crossing alignments
  $alg = $french->cpos2alg($cpos);               # returns undef if no alignment was found
  ($src_start, $src_end, $target_start, $target_end) 
      = $french->alg2cpos($alg);                 # returns empty list on error
      # or use convenience function $french->cpos2alg2cpos($cpos);

  # Feature sets (used as values of CWB::CL::Attribute and CWB::CL::AttStruc)
  $np_f = $corpus->attribute("np_feat", 's');    # p- and s-attributes can store feature sets
  $fs_string = $np_f->cpos2str($cpos);           # feature sets are encoded as strings
  $fs = CL::set2hash($fs_string);                # expand feature set into hash (reference)
  if (exists $fs->{"paren"}) { ... {}
  $fs1 = CWB::CL::make_set("|proper|nogen|");    # validate feature set or construct from string
  $fs2 = CWB::CL::make_set("paren nogen proper", 'split');
  $fs  = CWB::CL::set_intersection($fs1, $fs2);  # intersection of feature set values
  $n   = CWB::CL::set_size($fs);                 # size of feature set


Sorry, there is no full description for this module yet, since the CWB Corpus Library, on which CWB::CL is based, does not have complete documentation.

All of the corpus access function provided by the CWB::CL module are subject to change in version 4.0 of the CWB. If you want to use CWB::CL anyway, have a look at the test scripts in subdirectory t/ of the distribution.


Copyright (C) 1999-2010 by Stefan Evert [http::/]

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.