The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

dcdb-create.perl - create a DiaColloDB diachronic collocation database

SYNOPSIS

 dcdb-create.perl [OPTIONS] [INPUT(s)...]

 General Options:
   -help                ##-- this help message
   -version             ##-- report version information and exit
   -jobs NJOBS          ##-- number of threads for corpus compilation (default=-1: all cores)
   -xs , -pp            ##-- do/don't use fast XS implementations where available (default=if available)

 Corpus Options:
   -list , -nolist      ##-- INPUT(s) are/aren't file-lists (default=no)
   -glob , -noglob      ##-- do/don't glob INPUT(s) argument(s) (default=do)
   -union, -nounion     ##-- do/don't trate INPUT(s) as DB directories to be merged (default=don't)
   -lazy , -nolazy      ##-- do/don't create "lazy" list-client (union mode only; default=don't)
   -dclass CLASS        ##-- set corpus document class (default=DDCTabs)
   -dopt OPT=VAL        ##-- set corpus document option, e.g.
                        ##   eosre=EOSRE  # eos regex (default='^$')
                        ##   foreign=BOOL # disable D*-specific heuristics
   -bysent              ##-- track collocations by sentence (default)
   -byparagraph         ##-- track collocations by paragraph
   -bypage              ##-- track collocations by page
   -bydoc               ##-- track collocations by document

 Indexing Options:
   -attrs ATTRS         ##-- select index attributes (default=l,p)
                        ##   known attributes: l, p, w, doc.title, ...
   -use-all-the-data    ##-- disable default frequency- and regex-filters
   -64bit               ##-- use 64-bit quads where available
   -32bit               ##-- use 32-bit integers where available
   -dmax DIST           ##-- maximum distance for indexed co-occurrences (default=5)
   -tfmin TFMIN         ##-- minimum global term frequency (default=2)
   -lfmin LFMIN         ##-- minimum global lemma frequency (default=undef:tfmin)
   -cfmin CFMIN         ##-- minimum relation co-occurrence frequency (default=2)
   -[no]tdf             ##-- do/don't create (term x document) index relation (default=if available)
   -tdf-dbreak BREAK    ##-- set tdf matrix "document" granularity (e.g. s,p,page,file; default=file)
   -tdf-fmin VFMIN      ##-- set minimum tdf term frequency (default=undef: TFMIN)
   -tdf-dfmin VDFMIN    ##-- set minimum tdf term "document"-frequency (default=4)
   -tdf-nmin VNMIN      ##-- set minimum number of content tokens per tdf "document" (default=8)
   -tdf-nmax VNMAX      ##-- set maximum number of content tokens per tdf "document" (default=inf)
   -tdf-option OPT=VAL  ##-- set arbitrary tdf matrix option, e.g.
                        ##   minFreq=INT            # minimum term frequency (default=undef: use TFMIN)
                        ##   minDocFreq=INT         # minimum term document-"frequency" (default=4)
                        ##   minDocSize=INT         # minimum document size (#/terms) (default=4)
                        ##   maxDocSize=INT         # maximum document size (#/terms) (default=inf)
                        ##   mgood=REGEX            # positive regex for document-level metatdata
                        ##   mbad=REGEX             # negative regex for document-level metatdata
   -option OPT=VAL      ##-- set arbitrary DiaColloDB option, e.g.
                        ##   pack_id=PACKFMT        # pack-format for IDs
                        ##   pack_f=PACKFMT         # pack-format for frequencies
                        ##   pack_date=PACKFMT      # pack-format for dates
                        ##   (p|w|l)good=REGEX      # positive regex for (postags|words|lemmata)
                        ##   (p|w|l)bad=REGEX       # negative regex for (postags|words|lemmata)
                        ##   (p|w|l)goodfile=FILE   # positive list-filefor (postags|words|lemmata)
                        ##   (p|w|l)badfile=FILE    # negative list-file for (postags|words|lemmata)
                        ##   ddcServer=HOST:PORT    # server for ddc relations
                        ##   ddcTimeout=SECONDS     # timeout for ddc relations

 I/O and Logging Options:
   -log-level LEVEL     ##-- set log-level (default=TRACE)
   -log-option OPT=VAL  ##-- set log option (e.g. logdate, logtime, file, syslog, stderr, ...)
   -[no]keep            ##-- do/don't keep temporary files (default=don't)
   -[no]mmap            ##-- do/don't use mmap for file access (default=do)
   -[no]debug           ##-- do/don't enable painful debugging checks (default=don't)
   -[no]times           ##-- do/don't report operating timing (default=do)
   -output OUT          ##-- output directory or client configuration file (required)

 Environment Variables:
   DIACOLLO_SORT        ##-- system sort command prefix
   SORT                 ##-- fallback for DIACOLLO_SORT

DESCRIPTION

dcdb-create.perl compiles a DiaColloDB diachronic collocation database from a tokenized and annotated input corpus, or merges multiple existing DiaColloDB databases into a single database directory. The resulting database can be queried with the dcdb-query.perl(1) script, or wrapped into a web-service with the help of the DiaColloDB::WWW utilities, which see for details.

OPTIONS AND ARGUMENTS

Arguments

INPUT(s)

File(s), glob(s), file-list(s) to be indexed or existing indices to be merged. Interpretation depends on the -glob, -list, -union, and -lazy options.

General Options

-help

Display a brief help message and exit.

-version

Display version information and exit.

-jobs NJOBS

Run NJOBS parallel compilation threads. If specified as 0, will run only a single thread. The default value (-1) will run as many jobs as there are cores on the (unix/linux) system; see "nJobs" in DiaColloDB::Utils for details. Also sets the environment variable OMP_NUM_THREADS after interpreting the NJOBS request.

Corpus Options

Input corpora can be either "raw" corpora using the default DiaColloDB::Corpus class or a single "pre-compiled" corpus directory using the DiaColloDB::Corpus::Compiled conventions as created by the dcdb-corpus-compile.perl(1) script.

If a pre-compiled input corpus directory is specified, only the corpus content filters pre-compiled into the corpus itself are used, and the corpus content filter options to this script (-Opgood=REGEX etc.) will have no effect. For "raw" input corpora, a temporary DiaColloDB::Corpus::Compiled object will be created and the DiaColloDB::Corpus::Filters options to this script should be honored.

-list
-nolist

Do/don't treat INPUT(s) as file-lists rather than corpus data files or pre-compiled corpus directories. Default=don't.

-glob
-noglob

Do/don't expand wildcards in INPUT(s). Has no effect for pre-compiled corpus directories. Default=do.

-union
-nounion

Do/don't trate INPUT(s) as DB directories to be merged. Creates a new physical DB by merging data from the argument INPUT(s). Default=don't.

-lazy
-nolazy

Enable/disable "lazy union" mode. If enabled, INPUT(s) are treated as DB URLs to be merged "lazily", and only a simple DiaColloDB::Client::list configuration file OUT is created, suitable for passing to dcdb-query.perl as rcfile://OUT. User options specified with -option OPT=VAL will clobber the DiaColloDB::Client::list defaults (e.g. fudge, fork, etc.). Unlike -union mode, no physical DB is created in -lazy mode; queries to the lazy client are deferred to the underlying DB URLs specified in the configuration file. The lazy configuration should behave like a physical DB created with -union, can be created in near constant time, requires only a few bytes of disk space, and may even process queries faster than a physical DB if you have the threads module installed.

Default=off.

Aliases: -lazy-union, -list-union, -lu

-dclass CLASS

Set corpus document class (default=DDCTabs) for raw (i.e. not pre-compiled) corpora. See "SUBCLASSES" in DiaColloDB::Document for a list of supported input formats. If you are using the default DDCTabs document class on your own (non-D*) corpus, you may also want to specify -dopt foreign=1.

Has no effect for pre-compiled corpus directory INPUT(s).

-dopt OPT=VAL

Set corpus document option for raw (i.e. not pre-compiled) corpora, e.g. -dopt eosre=EOSRE sets the end-of-sentence regex for the default DDCTabs document class, and -dopt foreign=1 disables D*-specific hacks.

Potentially dangerous for pre-compiled corpus directory INPUT(s).

Aliases: -document-option, -docoption, -dO

-bysent

Track collocations by sentence (default). Has no effect for pre-compiled corpus directory INPUT(s).

-byparagraph

Track collocations by paragraph. Has no effect for pre-compiled corpus directory INPUT(s).

-bypage

Track collocations by page. Has no effect for pre-compiled corpus directory INPUT(s).

-bydoc

Track collocations by document. Has no effect for pre-compiled corpus directory INPUT(s).

Indexing Options

-attrs ATTRS

Select attributes to be indexed (default=l,p). Known attributes include l, p, w, doc.title, doc.author, etc.

-use-all-the-data

Disables default frequency- and regex-based pruning filter options, inspired by Mark Lauersdorf; equivalent to:

 -tfmin=0 \
 -lfmin=0 \
 -cfmin=0 \
 -tdf-tfmin=0 \
 -tdf-dfmin=0 \
 -tdf-nmin=0 \
 -tdf-nmax=inf \
 -O=pgood='' -O=poodfile='' \
 -O=wgood='' -O=wgoodfile='' \
 -O=lgood='' -O=lgoodfile='' \
 -O=pbad='' -O=pbadfile='' \
 -O=wbad='' -O=wbadfile='' \
 -O=lbad='' -O=lbadfile='' \
 -tO=mgood='' \
 -tO=mbad=''

Corpus content filters (pgood, pgoodfile, ..., lbad, lbadfile) have no effect for pre-compiled corpus directory INPUT(s)

Aliases: -all, -noprune, -nofilters, -F

-64bit

Use 64-bit quads to index integer IDs where available.

-32bit

Use 32-bit integers where available (default).

-dmax DIST

Specify maximum distance for indexed co-occurrences (default=5).

-tfmin TFMIN

Specify minimum global term frequency (default=2). A "term" in this sense is an n-tuple of indexed attributes not including the "date" component.

-lfmin LFMIN

Specify minimum global lemma frequency (default=undef:TFMIN).

-cfmin CFMIN

Specify minimum relation co-occurrence frequency (default=2).

-[no]tdf

Do/don't create (term x document) index relation (default=if available).

-tdf-dbreak BREAK

Set tdf matrix "document" granularity (e.g. s,p,page,file; default=file).

-tdf-fmin VFMIN

Set minimum tdf term frequency (default=undef: use TFMIN).

-tdf-dfmin VDFMIN

Set minimum term document-"frequency" (default=4).

-tdf-nmin VNMIN

Set minimum number of content tokens per tdf "document" (default=8).

-tdf-nmax VNMAX

Set maximum number of content tokens per tdf "document" (default=inf).

-tdf-option OPT=VAL

Set arbitrary tdf matrixDiaColloDB option, e.g.

 minFreq=INT            # -tdf-fmin: minimum term frequency
 minDocFreq=INT         # -tdf-dfmin: minimum term document-"frequency"
 minDocSize=INT         # -tdf-nmin: minimum document size (#/terms)
 maxDocSize=INT         # -tdf-nmax: maximum document size (#/terms)
 mgood=REGEX            # positive regex for document-level metatdata
 mbad=REGEX             # negative regex for document-level metatdata

Alias: -tO

-option OPT=VAL

Set arbitrary DiaColloDB index option, e.g.

 pack_id=PACKFMT        # pack-format for IDs
 pack_f=PACKFMT         # pack-format for frequencies
 pack_date=PACKFMT      # pack-format for dates
 (p|w|l)good=REGEX      # (raw input only) positive regex for (postags|words|lemmata)
 (p|w|l)bad=REGEX       # (raw input only) negative regex for (postags|words|lemmata)
 (p|w|l)goodfile=REGEX  # (raw input only) positive list-file for (postags|words|lemmata)
 (p|w|l)badfile=REGEX   # (raw input only) negative list-file for (postags|words|lemmata)
 ddcServer=HOST:PORT    # server for ddc relations
 ddcTimeout=SECONDS     # timeout for ddc relations

Alias: -O

I/O and Logging Options

-log-level LEVEL

Set DiaColloDB::Logger log-level (default=TRACE).

-log-option OPT=VAL

Set arbitrary DiaColloDB::Logger option (e.g. logdate, logtime, file, syslog, stderr, ...).

-[no]keep

Do/don't keep temporary files (default=don't)

-[no]mmap

Do/don't use mmap() for low-level index file access (default=do)

-[no]debug

Do/don't enable painful debugging checks (default=don't)

-[no]times

Do/don't report operating timing (default=do)

-output OUT

Output directory or filename (required).

BUGS AND LIMITATIONS

Probably many.

ACKNOWLEDGEMENTS

Perl by Larry Wall.

AUTHOR

Bryan Jurish <moocow@cpan.org>

SEE ALSO

DiaColloDB(3pm), dcdb-corpus-compile.perl(1), dcdb-info.perl(1), dcdb-query.perl(1), dcdb-export.perl(1), perl(1).