Dmitry Selverstov

NAME

WWW::Leech::Parser - HTML Page parser used by WWW::Leech::Walker

SYNOPSIS

  use WWW::Leech::Parser;

  my $parser = new WWW::Leech::Parser({
    'item:link' => '//a[contains(@class,"item-link")]',
    'nextpage:link' => '//a[contains(@class,"next-page-link")]',
    'fields' => {
      'name' => '//h1',
      'images[]' => '//img/@src',
      'comments[]' =>{
        type => 'html',
        xpath => '//div[@class="comments"]/div',
        filter => sub{
          my $values = shift;
          my $field_defs = shift;

          # ....

          return $values;
        }

      }
      # ....
    }
  });

  my $html_string = '...';
  
  my $links_and_next_page_url = $parser->parseList($html_string);

  my $item = $parser->parse($html_string);

DESCRIPTION

WWW::Leech::Parser extracts certain information from web page using provided XPath expressions.

First of all it is used to get links to 'sub-pages' and links to 'next-page' from a links-list-page (e.g. search engine results). Also it extracts required data from given HTML using rules defined upon object creation.

DETAILS

new($rules)

$rules is a hashref with following keys:

XPath extracting links to sub-pages

XPath extracting link to next links-list page

fields

Fields tell parser how to extract data. Can be provided as an arrayref:

  $fields = [
    {
      name => 'fieldname1',
      xpath => '//somenode'
    },
    {
      name => 'fieldname2',
      xpath => '//othernode'
    }
  ]
  

Or a hashref:

  $fields = 
    {
      fieldname1 => '//somenode',
      fieldname2 => {
        xpath => '//othernode'
      }
    }
  ]

By default parser uses first node found text as a value for the element. Appending '[]' sequence to key name switches parser to 'wantarray' mode. Parser will return an array of values in this case.

Every element can be provided in a simple or a complex form.

Simple form is just a key-value pair where key is a name of a field and value is an XPath expression.

In complex form a hashref determining details about the field must be provided. Following keys are recognized:

xpath

Required.

XPath expression for element data.

type

Optional.

 text - gets text content only (default)
 html - extracts all node content including node itself as is
 int - not appliable in 'wantarray' mode - removes non numeric characters from text value
 unique - only appliable in 'wantarray' mode - removes duplicates 
filter

Optional.

Coderef. Parser runs filter callback passing extracted value and field definitions. Field value is replaced with whatever callback returns.

parseList($html_string)

returns list-page links as a hashref:

  {
    links => [...], # URL's array
    links_text => [...], # Text inside corresponding 'a' tags
    next_page => "/page/N" # next page URL
  }
parse($html_string)

returns hashref with data extracted from page using 'fields' section from rules

AUTHOR

    Dmitry Selverstov
    CPAN ID: JAREDSPB
    jaredspb@cpan.org

COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.