The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

  CWB::Web::Search - A WWW search style front-end to the CWB

SYNOPSIS

  use CWB::Web::Search;

  # typically, a search object is used for a single search only
  $search = new CWB::Web::Search 'WEB-SITE-INDEX';
  # here, 'web-site-index' is a CWB-encoded corpus containing the
  # textual content of an indexed WWW site

  $search->window("1 document"); # search window 
  $search->context("document");  # match context returned as HTML
  # window and context size are specified in CQP syntax
  $search->data("url", "date");  # values of s-attributes are returned
  # here, markup is <url http://...> and <date 13 Oct 1999>
  $search->ignore_case(1);       # case-insensitive search
  $search->ignore_diacritics(1); # search ignores diacritics
  $search->cull('after');        # remove duplicate documents (context)
  $search->highlight('<font color=red>'); # HTML highlighting tag

  # run query - returns number of matches (for convenience)
  $nr_matches = $search->query("+editor", "free", "GNU", "-Microsoft");
  # look for documents containing the word 'editor', preferably
  # 'free' or 'GNU' as well, and not containing 'Microsoft'
  $nr_matches = $search->size;  # same as number returned by query()

  # alternatively, let WebSearch::Search parse the query string
  $nr_matches = $search->query_string("+editor free GNU -Microsoft");

  # typical result processing loop
  for ($i = 0; $i < $nr_matches; $i++) {
    $nr = $i + 1;               # match number
    $m = $search->match($i);    # returns result struct without 'context'
    $m->{'cpos'};               # corpus position of match centre
    $m->{'quality'};            # relevance of this match
    $m->{'summary'};            # summary of match (HTML encoded)
    $m->{'data'}->{'url'};      # requested data values
    $m->{'data'}->{'date'};
    if ($want_context) {
      $m = $search->match($i, 'context');
      $m->{'context'};          # match with context (HTML encoded)
    }
  }

  undef $search;

DESCRIPTION

The CWB::Web::Search module executes simple queries similar to commercial Web search engines on CWB-encoded corpora. The query() method returns keywords found in the corpus with the requested amount of context in HTML format. Additionally, data stored in structural attributes can be returned. Typically, a CGI script will create a CWB::Web::Search object for a single query.

ERRORS

If the CWB::Web::Search module encounters an error condition, an error message is printed on STDERR and the program is terminated. A user-defined error handler can be installed with the on_error() method. In this case, the error callback function is passed the error message generated by the module as a list of strings.

CORPUS REGISTRY

If you need to use a registry other than the default corpus registry, you should change the setting directly in the CWB::CL module.

  use CWB::CL;
  $CWB::CL::Registry = "/path/to/my/registry";

This will affect all new CWB::Web::Search objects.

RESULT STRUCTURE

The search module's match() method return a result struct for the n-th match of the last query executed. A CGI script will usually iterate through all matches with a loop similar to this:

    $nr_matches = $search->query(...);
    for ($n = 0; $n < $nr_maches; $n++) {
      $m = $search->match($n);
      # code for processing match data in result struct $m 
    }

A result struct $m has the following fields:

$m->{'cpos'}

Corpus position of the centre of this match (the centre is computed from the positions of all search keywords in a match).

$m->{'quality'}

An estimate of the relevance of this match. This ranking is given as a percentage with 100% corresponding to a "perfect match". The matches found by the query() method are sorted according to their 'quality' value.

$m->{'summary'}

A text segment from the corpus containing most of the <keywords> found in this match (up to a reasonable maxium length). It is returned in HTML format with the keywords highlighted.

$m->{'context'}

The text segment from the corpus containing all <keywords> found in this match, expanded according to the context() setting. It is returned in HTML format with the keywords highlighted.

NB The context field is only included if the 'context' switch was passed to the match() method:

    $m = $search->match($n, 'context');

See the remarks on virtual context in the description of the cull() method below.

$m->{'data'}

The values of the structural attributes requested by the data() method are returned in the subfields of the 'data' field. A typical CGI application will use the 'data' field to retrieve document URLs, e.g.

    $match_url = $m->{'data'}->{'url'};

where the search corpus contains regions like the following

    <url http://www.ims.uni-stuttgart.de/> ... </url>

The values stored in the 'data' field are not HTML encoded.

METHODS

$search = new CWB::Web::Search $corpus;

Create CWB::Web::Search object for WWW search queries on the CWB corpus $corpus.

@results = $search->query($key1, $key2, ... );

Searches corpus for the specified keywords and returns a list of matches sorted by (decreasing) relevance.

See "RESULT STRUCTURE" for the format of the @results list.

COPYRIGHT

Copyright (C) 1999-2022 Stephanie Evert [http::/purl.org/stephanie.evert]

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.