NAME

WWW::Scraper::TidyXML - TidyXML and XPath support for Scraper.

SYNOPSIS

    Use WWW::Scraper::TidyXML 

in your Scraper module.

Use 'TidyXML' and 'XPath' options in the scraperFrame to locate relevant regions of the results and details pages.

DESCRIPTION

One of the easiest ways to implement a Scraper engine for a new search engine is to use the "TidyXML" method.

The basic idea behind TidyXML is to convert the search engine's results page into XML. Then it is *much* easier to locate regions in the results page that contain the relevant data.

Here is how it works:

1. Process the results page through the Tidy program (available at tidy.sourceforge.com).

2. Convert the search engine's results page to XML with Tidy (via 'TidyXML' option in the scraperFrame).

3. Select relevant data via the 'XPath' scraperFrame operation (see XPath definition at http://www.w3.org/TR/xpath).

Most search engines generate remarkably ill-structured HTML. The Tidy program fixes that up, converts it to well-formed XML, making it accessible to Scraper's XML parsing operations. As XML, it is much easier to identify the relevant regions.

To initially develop a TidyXML-based Scraper module, you, as an implementor, manually process a results page through Tidy (using the -asxml option). This produces an XML file that can be viewed with any XML viewer. Internet Explorer or Netscape Navigator work well for this (my personal favorite is XMLSpy. Try toolbar 'Table'.'Display as Table' option for revealing views - get a 30-day free trial at www.xmlspy.com).

Browse the XML to visually identify the relevant regions. Then code these regions into your new engine implementation via XPath notation. E.G., the first <TABLE> of the converted HTML would be selected by

        /html/body/table

Skipping the first <TABLE> to select the second table would be selected

        /html/body/table[2]

You would then pass this <TABLE> to your next phase, where <TR>s in that table would be selected by

        /table/tr

and <TD>'s within that <TR> might be selected

        /tr/td[1]
        /tr/td[2]
        etc.

Complete the coding of your new engine implementation by specifying TidyXML conversion in the ScraperFrame. This causes Scraper repeat these Tidy conversions on each results page it processes.

EXAMPLES

Here is an example scraperFrame from a simple implementation for Dogpile.com

       [ 'TidyXML',
          [ 
            [ 'XPath', '/html/body',
              [
                [ 'HIT*' ,
                  [
                    [ 'XPath', '/body/p[hit()]',
                      [
                         [ 'A', 'url', 'title' ]
                        ,[ 'XPath', '/p/i', 'company' ]
                      ]
                    ],
                  ]
                ]
              ]
            ]
          ]
       ];

This took me a leisurely 30 minutes to discover and implement. Of course, Dogpile is remarkably well formed to begin with, and even there a complete implementation does require a few more touches. Take a look at Dogpile.pm for further details.

SPECIAL THANKS

To Dave Raggett <dsr@w3.org> (original author), and to Tor-Ivar Valåmo, and his SourceForge team, for TidyHTML.

Without this tool, I'd have wasted untold millennia trying to keep up with many search engines. This tool, along with XPath and XmlSpy, makes configuring Scraper modules to new results pages extremely easy. See TidyHTML at http://sourceforge.net/projects/tidyhtml/.

AUTHOR and CURRENT VERSION

WWW::Scraper::TidyXML is written and maintained by Glenn Wood, http://search.cpan.org/search?mode=author&query=GLENNWOOD.

COPYRIGHT

Copyright (c) 2002 Glenn Wood All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 378:

Non-ASCII character seen before =encoding in 'Valåmo,'. Assuming ISO8859-1