Alberto Simões 🐪


cwb-make - Automated indexing and compression for CWB corpora


  cwb-make [options] CORPUS [<attributes>]


  -r <dir>   use registry directory <dir> [system default]
  -M <n>     use <n> MBytes of RAM for indexing [default: 75]
  -V         validate newly created files
  -g <name>  put newly created files into group <name>
  -p <nnn>   set access permissions of created files to <nnn>
  -D         activate debugging output
  -h         show help page

Long forms of command-line options are listed below.


The cwb-make utility automates index building and compression for a CWB corpus, calling cwb-makeall, cwb-huffcode and cwb-compress-rdx as needed. Main advantages over the manual procedure are:

  • Old index files are updated automatically (unlike cwb-makeall, which does not check the age of index files), and it is safe to call cwb-make on an indexed and compressed corpus (again, unlike cwb-makeall).

  • Data files that are no longer needed after compression are immediately deleted.

  • The build process is optimised to reduce the amount of temporary disk space and memory needed. This is particularly important when indexing large corpora on 32-bit platforms, where cwb-makeall might easily run out of address space when called directly.

The basic usage pattern is

cwb-make [options] CORPUS [attribute ...]

where CORPUS is the CWB name (ID) of the corpus to be indexed (after encoding with cwb-encode) and should be written in upper case. If positional attributes are added at a later time, they can be indexed separately by specifying the attribute names after the corpus ID. Note that it is always safe simply to call cwb-make: existing indexed and compressed attributes will be ignored. Further command-line options are detailed below.

cwb-make is a minimal front-end to the CWB::Indexer functionality provided by the CWB::Encoder module, which can also be used directly from a Perl script. See "CWB::Indexer METHODS" in CWB::Encoder manpage for further information.


--registry=dir, -r dir

Use registry directory dir instead of standard registry (CWB default or specified by CORPUS_REGISTRY environment variable).

--memory=n, -M n

Use approx. n megabytes (MiB) of RAM for indexing. The default of 75 MiB is safe even for computers with a small amount of memory or many concurrent users. If more RAM is available, indexing can be speeded up considerably by setting higher memory limit. For instance, -M 500 or -M 1000 is a good choice on a machine with 2 GiB of RAM and a low work load.

--validate, -V

Validate newly created data files (index files and compressed corpus data). This is normally not required, as the CWB indexing and compression algorithms have been tested thoroughly by a large user community.

--group=name, -g name
--permissions=ddd, -p ddd

Set group membership (name) and access permissions (octal code ddd) of new data files. If these options are not specified, the system defaults for newly created files are used.

--debug, -D

Activate debugging output. Note that this is the only form of progress information provided by cwb-make, so you may want to specify -D simply in order to get some feedback during the indexing process.

--help, -h

Display help page with short usage summary (similar to SYNOPSIS above).


Copyright (C) 2002-2010 Stefan Evert [http::/]

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.