The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Alvis::QueryFilter - Perl module providing SRU query filtering

SYNOPSIS

   my $QF=Alvis::QueryFilter->new();

DESCRIPTION

Provides a query translation and filtering interface for an SRU server. Queries are first lemmatised by the Treetagger, and then translated according to rules in a set of dictionaries, and then fed to an SRU server. The results then have the query translation data added into the <extraResponseData> field.

Query translation uses a specific scheme for creating field names to use, and these fields are supported by the underlying SRU server.

Words in double quotes are left as is. The remaining words are lemmatised by the Treetagger and contiguous sequences match the term and named entity rules.

Terms recognised in the input query will generate a term="words" entry in the transformed query. If an ontology node exists for them, the corresponding ontology path will be prepended giving term="onto-path/words" entry. Named entities recognised in the input query, where ontologies are applied, will generate a entity="words" or entity="onto-path/words" entry. When typing is used for named entities, a entity-type="words" entry is made.

Words that are not used in either terms or named entities, that are lemmatised create a lemma="word" entry.

METHODS

new()

Create object.

         my $QF=Alvis::QueryFilter->new();

read_dicts()

Sets the filenames for the linguistic resources, and loads them up. Must be called once at the start.

   if (!$QF->read_dicts($lemma_dict_f, $term_dict_f, $NE_dict_f, 
                        $typing_rules_f, $onto_nodes_f, $onto_mapping_f)) {
     die("Reading the dictionaries failed.");
   }

Dictionary rules apply to the lemmatised forms after the Treetagger has been used.

$lemma_dict_f : Lists (text-occurence,lemma,part-of-speech) for lemmatising to be done on words left as unknown by the Treetagger. The part of speech is just annotation, so not used.

$term_dict_f : Lists (text-occurence,canonical-form) for terms.

$NE_dict_f : Lists (text-occurence,canonical-form) for named entities.

$typing_rules_f : Lists (canonical-form,type) for named entities. Types are short text items (e.g., 'species', 'company', 'person') used to categorise named entities when no ontology is in use.

$onto_nodes_f : Lists (canonical-form,ontology-node) for terms and named entities that are located in the ontology. If named entities occur here, $typing_rules_f should be empty.

$onto_mapping_f : Lists (ontology-node,ontology-path) giving fully expanded path for each node.

Entries in "NEs" and "terms" are applied as rules to query words, with longest match applying first. Once all these are done, the typing or ontology forms are applied.

set_canon()

Sets the functions used to convert terms and names to a canonical form that will be used when matching against dictionaries. Call before reading dictionaries. This can be used to handle comment elements of term matching such as (possibly dangerously) ignoring dashes.

       sub termcanonise { $_ = lc(shift());  s/[\s\-]//g; return $_; }
       sub namecanonise { $_ = shift();  s/[\s\-\.]//g; return $_; }
       $QF->set_canon(\&termcanonise,\&namecanonise);

set_lemma()

Sets the match field to identify whether a lemma located by Treetagger should be searched in lemma indexes or text indexes.

       $QF->set_lemma("^[NVJ]");

set_text_fields()

Sets the text fields expected of CQL output. Call before reading dictionaries.

      $QF->set_text_fields("text anchortext dc.title");

Fields are extracted by splitting on spaces.

The query filter assumes unfielded query terms are with the CQL field "text", and any other fields should only occur conjoined to the end of the query (i.e., not inside any other Boolean constructs). On output, and with the above call to &set_text_fields(), every CQL terminal node of form text="words" will be translated into the disjunct:

      ( text="words" OR anchortext="words" OR dc.title="words"  )

UI2Zebra()

Convert SRU request/input received from your HTTP server, for instance, and do the query translation to generate a new SRU request ready to send to the real SRU server. Details of the query mapping are stored with the object for later use by Zebra2UI().

      my $ToZebra=$QF->UI2Zebra($SRU);
      my $ua = LWP::UserAgent->new;
      my $response = $ua->get("http://localhost:10000/$ToZebra");

Zebra2UI()

Filter the XML-wrapped as a HTTP response, received from the real SRU server to add the query translation data into the <extraResponseData> field as a <filter> entry. The argument is a reference to the response text.

   my $ua = LWP::UserAgent->new;
   my $response = $ua->get("http://localhost:10000/$ToZebra");
   if ( ! $QF->Zebra2UI( $response->content_ref ) ) {
      print STDERR "Unable to insert query for $SRU\n";
   }
   #  $response now ready to send back

SEE ALSO

See Alvis::Treetagger(3), run_QF.pl(1).

See http://www.alvis.info/alvis/Architecture_2fFormats#queryfilter for the XML formats and the schema.

See http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html for the Treetagger.

AUTHOR

Kimmo Valtonen, and some packaging by Wray Buntine.

COPYRIGHT AND LICENSE

Copyright (C) 2006 Kimmo Valtonen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.