NAME

WWW::Leech::Parser - HTML Page parser used by WWW::Leech::Walker

SYNOPSIS

use WWW::Leech::Parser;

my $parser = new WWW::Leech::Parser({
  'item:link' => '//a[contains(@class,"item-link")]',
  'nextpage:link' => '//a[contains(@class,"next-page-link")]',
  'fields' => {
    'name' => '//h1',
    'images[]' => '//img/@src',
    'comments[]' =>{
      type => 'html',
      xpath => '//div[@class="comments"]/div',
      filter => sub{
        my $values = shift;
        my $field_defs = shift;

        # ....

        return $values;
      }

    }
    # ....
  }
});

my $html_string = '...';

my $links_and_next_page_url = $parser->parseList($html_string);

my $item = $parser->parse($html_string);

DESCRIPTION

WWW::Leech::Parser extracts certain information from web page using provided XPath expressions.

First of all it is used to get links to 'sub-pages' and links to 'next-page' from a links-list-page (e.g. search engine results). Also it extracts required data from given HTML using rules defined upon object creation.

DETAILS

new($rules)

$rules is a hashref with following keys:

item:link

XPath extracting links to sub-pages

nextpage:link

XPath extracting link to next links-list page

fields

Fields tell parser how to extract data. Can be provided as an arrayref:

$fields = [
  {
    name => 'fieldname1',
    xpath => '//somenode'
  },
  {
    name => 'fieldname2',
    xpath => '//othernode'
  }
]

Or a hashref:

$fields = 
  {
    fieldname1 => '//somenode',
    fieldname2 => {
      xpath => '//othernode'
    }
  }
]

By default parser uses first node found text as a value for the element. Appending '[]' sequence to key name switches parser to 'wantarray' mode. Parser will return an array of values in this case.

Every element can be provided in a simple or a complex form.

Simple form is just a key-value pair where key is a name of a field and value is an XPath expression.

In complex form a hashref determining details about the field must be provided. Following keys are recognized:

xpath

Required.

XPath expression for element data.

type

Optional.

text - gets text content only (default)
html - extracts all node content including node itself as is
int - not appliable in 'wantarray' mode - removes non numeric characters from text value
unique - only appliable in 'wantarray' mode - removes duplicates

filter

Optional.

Coderef. Parser runs filter callback passing extracted value and field definitions. Field value is replaced with whatever callback returns.

parseList($html_string)

returns list-page links as a hashref:

{
  links => [...], # URL's array
  links_text => [...], # Text inside corresponding 'a' tags
  next_page => "/page/N" # next page URL
}

parse($html_string)

returns hashref with data extracted from page using 'fields' section from rules

AUTHOR

Dmitry Selverstov
CPAN ID: JAREDSPB
jaredspb@cpan.org

COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

To install WWW::Leech::Parser, copy and paste the appropriate command in to your terminal.

cpanm

cpanm WWW::Leech::Parser

CPAN shell

perl -MCPAN -e shell
install WWW::Leech::Parser

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)