WWW::Scraper::TidyXML - TidyXML and XPath support for Scraper.
in your Scraper module.
Use 'TidyXML' and 'XPath' options in the scraperFrame to locate relevant regions of the results and details pages.
One of the easiest ways to implement a Scraper engine for a new search engine is to use the "TidyXML" method.
The basic idea behind TidyXML is to convert the search engine's results page into XML. Then it is *much* easier to locate regions in the results page that contain the relevant data.
Here is how it works:
1. Process the results page through the Tidy program (available at tidy.sourceforge.com).
2. Convert the search engine's results page to XML with Tidy (via 'TidyXML' option in the scraperFrame).
3. Select relevant data via the 'XPath' scraperFrame operation (see XPath definition at http://www.w3.org/TR/xpath).
Most search engines generate remarkably ill-structured HTML. The Tidy program fixes that up, converts it to well-formed XML, making it accessible to Scraper's XML parsing operations. As XML, it is much easier to identify the relevant regions.
To initially develop a TidyXML-based Scraper module, you, as an implementor, manually process a results page through Tidy (using the -asxml option). This produces an XML file that can be viewed with any XML viewer. Internet Explorer or Netscape Navigator work well for this (my personal favorite is XMLSpy. Try toolbar 'Table'.'Display as Table' option for revealing views - get a 30-day free trial at www.xmlspy.com).
Browse the XML to visually identify the relevant regions. Then code these regions into your new engine implementation via XPath notation. E.G., the first <TABLE> of the converted HTML would be selected by
Skipping the first <TABLE> to select the second table would be selected
You would then pass this <TABLE> to your next phase, where <TR>s in that table would be selected by
and <TD>'s within that <TR> might be selected
/tr/td /tr/td etc.
Complete the coding of your new engine implementation by specifying TidyXML conversion in the ScraperFrame. This causes Scraper repeat these Tidy conversions on each results page it processes.
Here is an example scraperFrame from a simple implementation for Dogpile.com
[ 'TidyXML', [ [ 'XPath', '/html/body', [ [ 'HIT*' , [ [ 'XPath', '/body/p[hit()]', [ [ 'A', 'url', 'title' ] ,[ 'XPath', '/p/i', 'company' ] ] ], ] ] ] ] ] ];
This took me a leisurely 30 minutes to discover and implement. Of course, Dogpile is remarkably well formed to begin with, and even there a complete implementation does require a few more touches. Take a look at Dogpile.pm for further details.
Without this tool, I'd have wasted untold millennia trying to keep up with many search engines. This tool, along with XPath and XmlSpy, makes configuring Scraper modules to new results pages extremely easy. See TidyHTML at http://sourceforge.net/projects/tidyhtml/.
WWW::Scraper::TidyXML is written and maintained by Glenn Wood, http://search.cpan.org/search?mode=author&query=GLENNWOOD.
Copyright (c) 2002 Glenn Wood All rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 378:
Non-ASCII character seen before =encoding in 'Valåmo,'. Assuming ISO8859-1