The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WWW::Search::Generic - class for generic searching.

SYNOPSIS

    require WWW::Search;
    $search = new WWW::Search('Generic');

DESCRIPTION

This class implements a Generic search engine that is configurable via a simple configuration file.

Its current form is more of a proof of concept, sort of a alpha version 0.1. Much more work needs to be done before it can rightfully earn its name.

This class exports no public interface; all interaction should be done through WWW::Search objects.

OPTIONS

With the exception of the location of the configuration file, all the options can be specified on the command line, or within the configuration file.

search_url=URL

Specifies who to query with the Generic search engine. There is no default provided, so this is required.

search_debug, search_parse_debug, search_ref

Specified at WWW::Search.

search_prefix=prefix

Specifies the expression to use before the query string. For instance, Amazon's URL queries have the following form: http://www.amazon.com/keyword=searchstring&...

The search_prefix in this case would be "keyword=".

search_base_url=URL

Specifies a base URL to add to any URLs discovered by the engine. For instance, in the case of Amazon.com, all the generated URLs in a search are relative. The search_base_url can be used to prepended a string to each result to produce an absolute reference.

search_next_base_url=URL

Similar to search_base_url, this can be used to create an absolute reference for the next URL to be fetched by the engine in the case of multi-page searches.

NOTE: It would be better if the absolute URLs were computed, rather than forcing the user to specify these options.

SEE ALSO

To make new back-ends, see WWW::Search.

HOW DOES IT WORK?

The user first specifies an INI-style configuration file that defines the parameters of the search. The configuration file is broken down into 3 major sections:

[options]

This defines the engine-level options such as search_url, search_base_url, etc. One can also define options to be used in the generated query string. For instance, color=black would append '&color=black' to the final query string URL.

[search]

These define the search-level patterns. This is best explained by way of example:

    search1=approximate_result_count
    search1_pat=([0-9]+) total matches for

The above two configuration parameters define how to determine the approximate_result_count for a WWW::Search. Basically, the Generic engine will look for the first match of the search pattern and set "approximate_result_count" to the first matching sub-pattern (ie, $1). See below for more complex examples of search patterns, which are at the heart of the Generic engine.

[hit]

These define the hit-level patterns. These differ from the search-level patterns in that they are continuously and exhaustively applied to the result page until no more matches are left. For example:

    search1=raw add_url title author format date price
    search1_pat= <<EOT
    (<tr valign=top>
    <td rowspan=2>
    <b>.*</b></td>
    <td colspan=2 valign=top>
    <font size=-1><b>
    <a href=(.*)>(.*)</a></b><br>
    by (.*)\.
    (.*)
    \((.*)\)
    </td>
    </tr>
    <tr valign=top><td.*><font size=-1>
    Our Price:\$(.*)<br>)

The above pattern will be applied to the page, and Generic will create a new 'hit' (ie, WWW::SearchResult object) for every match. The matching sub-patterns (ie, $1, $2, $3, etc. in Perl nomenclature) will be assigned to raw, add_url, title, author, etc.

The assignment is done by first checking if the 'hit' object supports a method by that name (e.g., raw, add_url, title) and then calling that method. If the method is not supported (eg, author, format, price), then the generic _elem method is called for that particular property.

SAMPLE CONFIGURATION FILE

Here is a sample configuration file for searching Amazon.com:

[options] search_base_url=http://www.amazon.com search_next_base_url=http://www.amazon.com search_url=http://www.amazon.com/exec/obidos/external-search search_prefix=keyword= search_debug=1 index=books rank=+featuredrank

# For search-level parameters and search patterns. [search] search1=approximate_result_count search1_pat=([0-9]+) total matches for search2=_next_url search2_pat=<a href=(.*)><img src="http://g-images.amazon.com/images/G/01/search-browse/button-more-results.gif" width=101 height=20 border=0 alt="More Results"></a>

# For hit-level parameters and search patterns, e.g.,: # add_url, change_date, description, index_date, normalized_score, raw, # score, size, title, company, location, source [hit] search1=raw add_url title author format date price availability score search1_pat= <<EOT (<tr valign=top> <td rowspan=2> <b>.*</b></td> <td colspan=2 valign=top> <font size=-1><b> <a href=(.*)>(.*)</a></b><br> by (.*)\. (.*) \((.*)\) </td> </tr> <tr valign=top><td.*><font size=-1> Our Price:\$(.*)<br> <br> </td> <td.*><font size=-1> <font color=\#990000> (.*)<BR> <!--.* --> </font> Average Customer Review: <IMG SRC="http://images.amazon.com/images/G/01/detail/stars-(.*).gif" border=0 height=12 width=64 ALT=".*"> </td> </tr>) EOT

AUTHOR

WWW::Search::Generic is written and maintained by Robert Locke, <rlocke@infiniteinfo.com>.

COPYRIGHT

Copyright (c) 2000 Infiniteinfo, Inc. All rights reserved.

Redistribution and use in source and binary forms are permitted provided that the above copyright notice and this paragraph are duplicated in all such forms and that any documentation, advertising materials, and other materials related to such distribution and use acknowledge that the software was developed by the Infiniteinfo, Inc. The name of the company may not be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 94:

'=item' outside of any '=over'

Around line 152:

You forgot a '=back' before '=head1'