This script creates a physical copy of a virtual subcorpus of a CWB-indexed corpus. It is often more convenient to access such a subcorpus as a separately indexed CWB corpus, and may be required for software packages that are not designed to operate on subsets of a corpus. For relatively small subcorpora, working with the physical copy will also be much more efficient.
The virtual subcorpus is a collection of textual units (any s-attribute, specified with option C<-S>). It is defined by a CQP query and consists of all units that contain at least one match of the query. This approach ensures great flexibility, allowing subcorpora to be defined in terms of metadata, lexical items and even grammatical features.
B<cwb-make-subcorpus> automatically copies all positional and structural attributes, adjusting s-attribute regions as needed. In particular, any regions outside the subcorpus are dropped, while regions spanning one or more text units from the subcorpus as well as other material are narrowed down to the subcorpus. The script also convert the physical copy to a different character encoding, but it is better to use B<cwb-convert-to-utf8> for upgrading corpora to UTF-8 format.
There are some important B<limitations>:
=over 4
=item *
The script does I<not> copy alignment attributes (because it relies on B<cwb-decode>, which cannot handle a-attributes). Any alignments will be absent from the subcorpus.
=item *
Re-encoding to a different character set silently deletes invalid characters, so the content of the physical copy may no longer be identical to the virtual subcorpus.
=back
=head1 ARGUMENTS
=over 4
=item CORPUS
CWB ID of the original corpus
=item SUBCORPUS
New CWB ID for the physical copy to be created
=item datadir
New directory for CWB index files of the physical copy. This directory must not yet exist (unless overwritten with C<--force>).
=item query
A CQP query that identifies text units to be included in the virtual subcorpus. Usually enclosed in single quotation marks on the command line.
=back
=head1 OPTIONS
=over 4
=item --registry=I<dir>, -r I<dir>
Search the original corpus in registry directory I<dir> rather than the default registry path.
=item --output-registry=I<dir>, -or I<dir>
Create registry entry for the new corpus in I<dir>. [default: same registry directory as the original corpus]
=item --by=I<att>, -S I<att>
S-attribute defining basic textual units for the virtual subcorpus, which consists of all such units that contain at least one match of the CQP query. [default: C<text>]
=item --charset=I<enc>, -C I<enc>
Character encoding of the physical copy. Any of the character encodings supported by CWB 3.5 may be specified. If different from the character encoding of the original corpus, all attributes will automatically be converted, silently deleting invalid characters. [default: same as original corpus]
=item --memory=I<n>, -M I<n>
Allow B<cwb-make> to use approx. I<n> MBytes of RAM for indexing.
=item --force, -f
Silently overwrite an existing registry entry and/or data directory. Use with caution, as this will remove all files from an existing data directory.
=item --verbose, -v
Show progress message for each individual attribute (recommended for large corpora).
=item --help, -h
Display short help page.
=back
=head1 PREREQUISITES
B<cwb-make-subcorpus> requires a recent version of CWB with special support in the B<cwb-encode> utility, viz. B<CWB v3.5.0> or newer. If you have installed multiple CWB releases on your computer, make sure that the CWB/Perl modules are configured to use an appropriate CWB version.
For efficiency reasons, character encodings are converted with the external B<iconv> utility, which must be installed somewhere in the system path. Your version of B<iconv> must support command line options C<-f> (source encoding), C<-t> (target encoding) and C<-c> (ignore conversion errors); it also needs to understand CWB style encoding names such as C<utf8> and C<latin1>. Suitable versions of B<iconv> are provided e.g. by Linux and Mac OS X.