dcdb-create.perl - create a DiaColloDB diachronic collocation database
dcdb-create.perl [OPTIONS] [INPUT(s)...] General Options: -help ##-- this help message -version ##-- report version information and exit -jobs NJOBS ##-- number of threads for corpus compilation (default=-1: all cores) -xs , -pp ##-- do/don't use fast XS implementations where available (default=if available) Corpus Options: -list , -nolist ##-- INPUT(s) are/aren't file-lists (default=no) -glob , -noglob ##-- do/don't glob INPUT(s) argument(s) (default=do) -union, -nounion ##-- do/don't trate INPUT(s) as DB directories to be merged (default=don't) -lazy , -nolazy ##-- do/don't create "lazy" list-client (union mode only; default=don't) -dclass CLASS ##-- set corpus document class (default=DDCTabs) -dopt OPT=VAL ##-- set corpus document option, e.g. ## eosre=EOSRE # eos regex (default='^$') ## foreign=BOOL # disable D*-specific heuristics -bysent ##-- track collocations by sentence (default) -byparagraph ##-- track collocations by paragraph -bypage ##-- track collocations by page -bydoc ##-- track collocations by document Indexing Options: -attrs ATTRS ##-- select index attributes (default=l,p) ## known attributes: l, p, w, doc.title, ... -use-all-the-data ##-- disable default frequency- and regex-filters -64bit ##-- use 64-bit quads where available -32bit ##-- use 32-bit integers where available -dmax DIST ##-- maximum distance for indexed co-occurrences (default=5) -tfmin TFMIN ##-- minimum global term frequency (default=2) -lfmin LFMIN ##-- minimum global lemma frequency (default=undef:tfmin) -cfmin CFMIN ##-- minimum relation co-occurrence frequency (default=2) -[no]tdf ##-- do/don't create (term x document) index relation (default=if available) -tdf-dbreak BREAK ##-- set tdf matrix "document" granularity (e.g. s,p,page,file; default=file) -tdf-fmin VFMIN ##-- set minimum tdf term frequency (default=undef: TFMIN) -tdf-dfmin VDFMIN ##-- set minimum tdf term "document"-frequency (default=4) -tdf-nmin VNMIN ##-- set minimum number of content tokens per tdf "document" (default=8) -tdf-nmax VNMAX ##-- set maximum number of content tokens per tdf "document" (default=inf) -tdf-option OPT=VAL ##-- set arbitrary tdf matrix option, e.g. ## minFreq=INT # minimum term frequency (default=undef: use TFMIN) ## minDocFreq=INT # minimum term document-"frequency" (default=4) ## minDocSize=INT # minimum document size (#/terms) (default=4) ## maxDocSize=INT # maximum document size (#/terms) (default=inf) ## mgood=REGEX # positive regex for document-level metatdata ## mbad=REGEX # negative regex for document-level metatdata -option OPT=VAL ##-- set arbitrary DiaColloDB option, e.g. ## pack_id=PACKFMT # pack-format for IDs ## pack_f=PACKFMT # pack-format for frequencies ## pack_date=PACKFMT # pack-format for dates ## (p|w|l)good=REGEX # positive regex for (postags|words|lemmata) ## (p|w|l)bad=REGEX # negative regex for (postags|words|lemmata) ## (p|w|l)goodfile=FILE # positive list-filefor (postags|words|lemmata) ## (p|w|l)badfile=FILE # negative list-file for (postags|words|lemmata) ## ddcServer=HOST:PORT # server for ddc relations ## ddcTimeout=SECONDS # timeout for ddc relations I/O and Logging Options: -log-level LEVEL ##-- set log-level (default=TRACE) -log-option OPT=VAL ##-- set log option (e.g. logdate, logtime, file, syslog, stderr, ...) -[no]keep ##-- do/don't keep temporary files (default=don't) -[no]mmap ##-- do/don't use mmap for file access (default=do) -[no]debug ##-- do/don't enable painful debugging checks (default=don't) -[no]times ##-- do/don't report operating timing (default=do) -output OUT ##-- output directory or client configuration file (required) Environment Variables: DIACOLLO_SORT ##-- system sort command prefix SORT ##-- fallback for DIACOLLO_SORT
dcdb-create.perl compiles a DiaColloDB diachronic collocation database from a tokenized and annotated input corpus, or merges multiple existing DiaColloDB databases into a single database directory. The resulting database can be queried with the dcdb-query.perl(1) script, or wrapped into a web-service with the help of the DiaColloDB::WWW utilities, which see for details.
File(s), glob(s), file-list(s) to be indexed or existing indices to be merged. Interpretation depends on the -glob, -list, -union, and -lazy options.
Display a brief help message and exit.
Display version information and exit.
Run NJOBS parallel compilation threads. If specified as 0, will run only a single thread. The default value (-1) will run as many jobs as there are cores on the (unix/linux) system; see "nJobs" in DiaColloDB::Utils for details. Also sets the environment variable OMP_NUM_THREADS after interpreting the NJOBS request.
NJOBS
OMP_NUM_THREADS
Input corpora can be either "raw" corpora using the default DiaColloDB::Corpus class or a single "pre-compiled" corpus directory using the DiaColloDB::Corpus::Compiled conventions as created by the dcdb-corpus-compile.perl(1) script.
If a pre-compiled input corpus directory is specified, only the corpus content filters pre-compiled into the corpus itself are used, and the corpus content filter options to this script (-Opgood=REGEX etc.) will have no effect. For "raw" input corpora, a temporary DiaColloDB::Corpus::Compiled object will be created and the DiaColloDB::Corpus::Filters options to this script should be honored.
-Opgood=REGEX
Do/don't treat INPUT(s) as file-lists rather than corpus data files or pre-compiled corpus directories. Default=don't.
Do/don't expand wildcards in INPUT(s). Has no effect for pre-compiled corpus directories. Default=do.
Do/don't trate INPUT(s) as DB directories to be merged. Creates a new physical DB by merging data from the argument INPUT(s). Default=don't.
Enable/disable "lazy union" mode. If enabled, INPUT(s) are treated as DB URLs to be merged "lazily", and only a simple DiaColloDB::Client::list configuration file OUT is created, suitable for passing to dcdb-query.perl as rcfile://OUT. User options specified with -option OPT=VAL will clobber the DiaColloDB::Client::list defaults (e.g. fudge, fork, etc.). Unlike -union mode, no physical DB is created in -lazy mode; queries to the lazy client are deferred to the underlying DB URLs specified in the configuration file. The lazy configuration should behave like a physical DB created with -union, can be created in near constant time, requires only a few bytes of disk space, and may even process queries faster than a physical DB if you have the threads module installed.
-option OPT=VAL
fudge
fork
Default=off.
Aliases: -lazy-union, -list-union, -lu
Set corpus document class (default=DDCTabs) for raw (i.e. not pre-compiled) corpora. See "SUBCLASSES" in DiaColloDB::Document for a list of supported input formats. If you are using the default DDCTabs document class on your own (non-D*) corpus, you may also want to specify -dopt foreign=1.
-dopt foreign=1
Has no effect for pre-compiled corpus directory INPUT(s).
Set corpus document option for raw (i.e. not pre-compiled) corpora, e.g. -dopt eosre=EOSRE sets the end-of-sentence regex for the default DDCTabs document class, and -dopt foreign=1 disables D*-specific hacks.
-dopt eosre=EOSRE
Potentially dangerous for pre-compiled corpus directory INPUT(s).
Aliases: -document-option, -docoption, -dO
Track collocations by sentence (default). Has no effect for pre-compiled corpus directory INPUT(s).
Track collocations by paragraph. Has no effect for pre-compiled corpus directory INPUT(s).
Track collocations by page. Has no effect for pre-compiled corpus directory INPUT(s).
Track collocations by document. Has no effect for pre-compiled corpus directory INPUT(s).
Select attributes to be indexed (default=l,p). Known attributes include l, p, w, doc.title, doc.author, etc.
l, p, w, doc.title, doc.author
Disables default frequency- and regex-based pruning filter options, inspired by Mark Lauersdorf; equivalent to:
-tfmin=0 \ -lfmin=0 \ -cfmin=0 \ -tdf-tfmin=0 \ -tdf-dfmin=0 \ -tdf-nmin=0 \ -tdf-nmax=inf \ -O=pgood='' -O=poodfile='' \ -O=wgood='' -O=wgoodfile='' \ -O=lgood='' -O=lgoodfile='' \ -O=pbad='' -O=pbadfile='' \ -O=wbad='' -O=wbadfile='' \ -O=lbad='' -O=lbadfile='' \ -tO=mgood='' \ -tO=mbad=''
Corpus content filters (pgood, pgoodfile, ..., lbad, lbadfile) have no effect for pre-compiled corpus directory INPUT(s)
pgood
pgoodfile
lbad
lbadfile
Aliases: -all, -noprune, -nofilters, -F
Use 64-bit quads to index integer IDs where available.
Use 32-bit integers where available (default).
Specify maximum distance for indexed co-occurrences (default=5).
Specify minimum global term frequency (default=2). A "term" in this sense is an n-tuple of indexed attributes not including the "date" component.
Specify minimum global lemma frequency (default=undef:TFMIN).
Specify minimum relation co-occurrence frequency (default=2).
Do/don't create (term x document) index relation (default=if available).
Set tdf matrix "document" granularity (e.g. s,p,page,file; default=file).
Set minimum tdf term frequency (default=undef: use TFMIN).
Set minimum term document-"frequency" (default=4).
Set minimum number of content tokens per tdf "document" (default=8).
Set maximum number of content tokens per tdf "document" (default=inf).
Set arbitrary tdf matrixDiaColloDB option, e.g.
minFreq=INT # -tdf-fmin: minimum term frequency minDocFreq=INT # -tdf-dfmin: minimum term document-"frequency" minDocSize=INT # -tdf-nmin: minimum document size (#/terms) maxDocSize=INT # -tdf-nmax: maximum document size (#/terms) mgood=REGEX # positive regex for document-level metatdata mbad=REGEX # negative regex for document-level metatdata
Alias: -tO
Set arbitrary DiaColloDB index option, e.g.
pack_id=PACKFMT # pack-format for IDs pack_f=PACKFMT # pack-format for frequencies pack_date=PACKFMT # pack-format for dates (p|w|l)good=REGEX # (raw input only) positive regex for (postags|words|lemmata) (p|w|l)bad=REGEX # (raw input only) negative regex for (postags|words|lemmata) (p|w|l)goodfile=REGEX # (raw input only) positive list-file for (postags|words|lemmata) (p|w|l)badfile=REGEX # (raw input only) negative list-file for (postags|words|lemmata) ddcServer=HOST:PORT # server for ddc relations ddcTimeout=SECONDS # timeout for ddc relations
Alias: -O
Set DiaColloDB::Logger log-level (default=TRACE).
Set arbitrary DiaColloDB::Logger option (e.g. logdate, logtime, file, syslog, stderr, ...).
Do/don't keep temporary files (default=don't)
Do/don't use mmap() for low-level index file access (default=do)
Do/don't enable painful debugging checks (default=don't)
Do/don't report operating timing (default=do)
Output directory or filename (required).
Probably many.
Perl by Larry Wall.
Bryan Jurish <moocow@cpan.org>
DiaColloDB(3pm), dcdb-corpus-compile.perl(1), dcdb-info.perl(1), dcdb-query.perl(1), dcdb-export.perl(1), perl(1).
To install DiaColloDB, copy and paste the appropriate command in to your terminal.
cpanm
cpanm DiaColloDB
CPAN shell
perl -MCPAN -e shell install DiaColloDB
For more information on module installation, please visit the detailed CPAN module installation guide.