The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

CWB::CQP - Interact with a CQP process running in the background

SYNOPSIS

  use CWB::CQP;

  # start CQP server process in the background
  $cqp = new CWB::CQP;
  $cqp = new CWB::CQP("-r /corpora/registry", "-I /global/init.cqp");

  # check for specified or newer CQP version
  $ok = $cqp->check_version($major, $minor, $beta);

  # activate corpus in managed mode (automatic character encoding conversion)
  $cqp->activate($corpus);

  # execute CQP command (blocking mode) and check for error
  @lines = $cqp->exec($my_cmd);
  unless ($cqp->ok) {
    @cqp_error_message = $cqp->error_message;
    my_error_handler();
  }

  # it's easier to use an automatic error handler
  $cqp->set_error_handler(\&my_error_handler); # user-defined
  $cqp->set_error_handler('die'); # built-in, useful for one-off scripts

  # read TAB-delimited table from count, group, tabulate, ...
  @table = $cqp->exec_rows($my_cmd);

  # run CQP command in background (non-blocking mode)
  $cqp->run($my_cmd);
  if ($cqp->ready) {  # specify optional timeout in seconds
    my $line = $cqp->getline;
    my @fields = $cqp->getrow; # TAB-delimited output
  }
  @lines = $cqp->getlines(10); # reads 10 lines, blocking if necessary

  # execute in query lock mode (to improve security of CGI scripts)
  $cqp->begin_query;
    # execute untrusted CQP queries
  $cqp->end_query;
  
  @lines = $cqp->exec_query($untrusted_query); # convenience wrapper
  
  # dump/undump a named query into/from a table of corpus positions
  @matches = $cqp->dump("Last" [, $from, $to]);
  $cqp->undump("Copy", @matches);  # produces copy of "Last"

  # safely quote regular expressions and literal strings for CQP queries
  $query = $cqp->quote('[0-9]+"-[a-z-]+');      # picks single or double quotes
  $query = $cqp->quote($cqp->quotemeta($word)); # escape all metacharacters

  # activate CQP progress messages during query execution
  $cqp->progress_on;
  $status = $cqp->progress; # after starting CQP command with run()
  ($total, $pass, $n_passes, $msg, $percent) = $cqp->progress_info;
  $cqp->progress_off;

  $cqp->set_progress_handler(\&my_progress_handler); # user-defined handler

  # shut down CQP server (exits gracefully)
  undef $cqp;

DESCRIPTION

A CWB::CQP object represents an instance of the corpus query processor CQP running as a background process. By calling suitable methods on this object, arbitrary CQP commands can be executed and their output can be captured. The STDERR stream of the CQP process is monitored for error messages, which can automatically trigger an error handler.

Every CWB::CQP object has its own CQP background process and communication is fully asynchronous. This enables scripts to perform other actions while a long CQP command is executing, or to run multiple CQP instances in parallel.

In managed mode (enabled with the activate method), the API works consistently with Perl Unicode strings, which are automatically translated to the character encoding of the CWB corpus in the background.

METHODS

The following methods are available:

$cqp = new CWB::CQP;
$cqp = new CWB::CQP '-r /corpora/registry', '-l /data/cqpresults';

Spawn new CQP background process. The object $cqp can then be used to communicate with this CQP instance. Optional arguments of the new method are passed as command-line options to CQP. Use at your own risk.

undef $cqp;

Exit CQP background process gracefully by issuing an exit; command. This is done automatically when the variable $cqp goes out of scope. Note that there may be a slight delay while CWB::CQP waits for the CQP process to terminate.

Do NOT send an exit; command to CQP explicitly (with exec or run). This looks like a program crash to CWB::CQP and will result in immediate termination of the Perl script.

$ok = $cqp->check_version($major, $minor, $beta);

Check for minimum required CQP version, i.e. the background process has to be CQP version $major.$minor.$beta or newer. $minor and $beta may be omitted, in which case they default to 0. Note that the CWB::CQP module automatically checks whether the CQP version is compatible with its own requirements when a new object is created. The check_version method can subsequently be used to check for a more recent release that provides functionality needed by the Perl script.

$version_string = $cqp->version;

Returns formatted version string for the CQP background process, e.g. 2.2.99 or 3.5.0.

$cqp->activate($corpus);

Activate $corpus and enable managed mode, i.e. automatic conversion between Perl Unicode strings and the character encoding of the CWB corpus. Conversion works in both directions, so CQP commands and queries must be passed as Perl Unicode strings and all return values are guaranteed to be Perl Unicode strings.

Managed mode simplifies interaction with CWB corpora in different encodings and ensures that Perl string operations are carried out correctly with Unicode character semantics (string length, case conversion for non-ASCII letters, Unicode character classes in regular expressions, etc.).

Possible reasons for using non-managed (raw) mode are: (i) that Latin1-encoded corpora can be processed faster as raw byte sequences; and (ii) that arbitrary byte values can be handled, even if they are not valid Latin1 code points.

NB: Once managed mode has been enabled, make sure always to use activate to switch to a different corpus. If the corpus is activated with a plain exec commmand, the $cqp object will not be notified of changes in character encoding.

Pass undef to disable managed mode and change back to raw byte semantics.

$cqp->run($cmd);

Start a single CQP command $cmd in the background. This method returns immediately. Command output can then be read with the getline, getlines and getrow methods. If asynchronous communication is desired, use ready to check whether output is available.

It is an error to run a new command before the output of the previous command has completely been processed.

$num_of_lines = $cqp->ready;
$num_of_lines = $cqp->ready($timeout);

Check if output from current CQP command is available for reading with getline etc., returning the number of lines currently held in the input buffer (possibly including an end-of-output marker line that will not be returned by getline etc.). If there is no active command, returns undef.

The first form of the command returns immediately. The second form waits up to $timeout seconds for CQP output to become available. Use a negative $timeout for blocking mode.

$line = $cqp->getline;

Read one line of output from CQP process, blocking if necessary until output beomes available. Returns undef when all output from the current CQP command has been read.

@lines = $cqp->getlines($n);

Read $n lines of output from the CQP process, blocking as long as necessary. An explicit undef element is included at the end of the output of a CQP command. Note that getlines may return fewer than $n lines if the end of output is reached.

Set $n = 0 to read all complete lines currently held in the input buffer (as indicated by the ready method), or specify a negative value to read the complete output of the active CQP command.

@lines = $cqp->exec($cmd);

A convenience function that executes CQP command $cmd, waits for it to complete, and returns all lines of output from the command.

Fully equivalent to the following two commands, except that the trailing undef returnd by getlines is not included in the output:

  $cqp->run($cmd);
  @lines = $cqp->getlines(-1);
@fields = $cqp->getrow;
@rows = $cqp->exec_rows($cmd);

Convenience functions for reading TAB-delimited tables, which are generated by CQP commands such as count, group, tabulate and show cd.

getrow returns a single row of output, split into TAB-delimited fields. If the active CQP command has completed, it returns an empty list.

exec_rows executes the CQP command $cmd, waits for it to complete, and then returns the TAB-delimited table as an array of array references. You can then use multiple indices to access a specific element of the table, e.g. @rows[41][2] for the third column of the 42nd row.

$cqp->begin_query;
$cqp->end_query;

Enter/exit query lock mode for safe execution of CQP queries entered by an untrusted user (e.g. from a Web interface). In query lock mode, all interactive CQP commands are temporarily disabled; in particular, it is impossible to access files or execute shell commands from CQP.

@lines = $cqp->exec_query($query);

Convenience function to execute a CQP query $query in safe query lock mode, wait for it to complete, and return its output as a list of lines.

Fully equivalent to the following sequence:

  $cqp->begin_query;
  @lines = $cqp->exec($query);
  $cqp->end_query;
@table = $cqp->dump($named_query [, $from, $to]);

Dump a named query result $named_query (or a part of it ranging from line $from to line $to) into a table of corpus positions, where each row corresponds to one match of the query. The table always has four columns for match, matchend, target and keyword positions, some of which may be -1 (undefined).

This function is a wrapper around the CQP command dump $named_query $from $to; provided for symmetry with the undump command.

$cqp->undump($named_query, @table);

Upload a table of corpus positions to a named query result in CQP. @table must be an array of array references, with two, three or four columns (where the third and fourth column hold target and keyword anchors, respectively). All rows in @table must have the same number of columns. Use -1 for undefined anchor values.

This method is not just a trivial wrapper around CQP's undump command. It stores the data in an appropriate format in a temporary disk file, and determines the correct form of the CQP command based on the number of columns in the table.

$status = $cqp->status; # "ok" or "error"
$ok = $cqp->ok;
@lines = $cqp->error_message;
$cqp->error(@message);

Error handling functions. status returns the status of the last CQP command executed, which is either 'ok' or 'error'. ok returns true or false, depending on whether the last command was completed successfully (i.e., it is a simple convenience wrapper for the expression ($cqp-status eq 'ok')>). error_message returns the error message (if any) generated by the last CQP command, as a list of chomped lines.

error is an internal function used to report CQP errors. It may also be of interest to application programs if a suitable error handler has been defined (see below).

$cqp->set_error_handler(&my_error_handler);
$cqp->set_error_handler('die' | 'warn' | 'ignore');

The first form of the set_error_handler method activates a user-defined error handler. The argument is a reference to a named or anonymous subroutine, which will be called whenever a CQP error is detected (or an error is raised explicitly with the error method). The error message is passed to the handler as an array of chomped lines. If the error handler returns, the error condition will subsequently be ignored (but still be reported by status and ok).

The second form of the method activates one of the built-in error handlers:

  • 'die' aborts program execution with an error message; this handler is particularly convenient for one-off scripts or command-line utilities that do not need to recover from error conditions.

  • 'warn' prints the error message on STDERR, but continues program execution. This is the default error handler of a new CWB::CQP object.

  • 'ignore' silently ignores all errors. The application script should check for error conditions after every CQP command, using the ok or status method.

$query = $cqp->quote($regexp);
$regexp = $cqp->quotemeta($string);

Safely quotes regular expressions and literal strings for use in CQP queries and other commands. The quote method encloses $regexp in single or double quotes, as appropriate, and escapes quote characters inside the string by doubling. quotemeta escapes all known regular expression metacharacters in $string with backslashes (including the backslash itself). It does not surround $string with quotes, so if you want a CQP expression that searches $string as a literal string, you have to combine them into $cqp->quote($cqp->quotemeta($string)). Both methods are vectorised, so you can pass multiple arguments in one call.

$cqp->debug(1);
$cqp->debug(0);

Activate/deactivate debugging mode, which logs all executed commands and their complete output on STDOUT. The debug method returns the previous status for convenience.

$cqp->progress_on;
$cqp->progress_off;
$message = $cqp->progress;
($total, $pass, $n_passes, $msg, $percent) = $cqp->progress_info;

CQP progress messages can be activated and deactivated with the progress_on and progress_off methods (corresponding to set ProgressBar on|off; in CQP).

If active, progress information can be obtained with the method progress, which returns the last progress message received from CQP. The progress_info returns pre-parsed progress information, consisting of estimated total percentage of completion ($total), the current pass ($pass) and total number of passes ($n_passes) for multi-pass operations, the information part ($msg, either a percentage or a free-form progress message), and the completion percentage of the current pass ($percent).

It is an error to call progress or progress_info without activating progress messages first.

$cqp->set_progress_handler(&my_progress_handler);

Set a user-defined progress handler, which will be invoked whenever new progress information is received from CQP. The argument must be a named or anonymous subroutine, which will be called with the information returned by progress_info. Note that setting a user-defined progress handler does not automatically activate progress information: you still need to call progress_on for this purpose.

Calling set_progress_handler with undef (or without an argument) disables the user-defined progress handler.

COPYRIGHT

Copyright (C) 2002-2022 Stephanie Evert [https://purl.org/stephanie.evert]

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.