WWW::Leech::Parser - HTML Page parser used by WWW::Leech::Walker
use WWW::Leech::Parser; my $parser = new WWW::Leech::Parser({ 'item:link' => '//a[contains(@class,"item-link")]', 'nextpage:link' => '//a[contains(@class,"next-page-link")]', 'fields' => { 'name' => '//h1', 'images[]' => '//img/@src', 'comments[]' =>{ type => 'html', xpath => '//div[@class="comments"]/div', filter => sub{ my $values = shift; my $field_defs = shift; # .... return $values; } } # .... } }); my $html_string = '...'; my $links_and_next_page_url = $parser->parseList($html_string); my $item = $parser->parse($html_string);
WWW::Leech::Parser extracts certain information from web page using provided XPath expressions.
First of all it is used to get links to 'sub-pages' and links to 'next-page' from a links-list-page (e.g. search engine results). Also it extracts required data from given HTML using rules defined upon object creation.
$rules is a hashref with following keys:
XPath extracting links to sub-pages
XPath extracting link to next links-list page
Fields tell parser how to extract data. Can be provided as an arrayref:
$fields = [ { name => 'fieldname1', xpath => '//somenode' }, { name => 'fieldname2', xpath => '//othernode' } ]
Or a hashref:
$fields = { fieldname1 => '//somenode', fieldname2 => { xpath => '//othernode' } } ]
By default parser uses first node found text as a value for the element. Appending '[]' sequence to key name switches parser to 'wantarray' mode. Parser will return an array of values in this case.
Every element can be provided in a simple or a complex form.
Simple form is just a key-value pair where key is a name of a field and value is an XPath expression.
In complex form a hashref determining details about the field must be provided. Following keys are recognized:
Required.
XPath expression for element data.
Optional.
text - gets text content only (default) html - extracts all node content including node itself as is int - not appliable in 'wantarray' mode - removes non numeric characters from text value unique - only appliable in 'wantarray' mode - removes duplicates
Coderef. Parser runs filter callback passing extracted value and field definitions. Field value is replaced with whatever callback returns.
returns list-page links as a hashref:
{ links => [...], # URL's array links_text => [...], # Text inside corresponding 'a' tags next_page => "/page/N" # next page URL }
returns hashref with data extracted from page using 'fields' section from rules
Dmitry Selverstov CPAN ID: JAREDSPB jaredspb@cpan.org
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
To install WWW::Leech::Parser, copy and paste the appropriate command in to your terminal.
cpanm
cpanm WWW::Leech::Parser
CPAN shell
perl -MCPAN -e shell install WWW::Leech::Parser
For more information on module installation, please visit the detailed CPAN module installation guide.