The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Text::Scraper - Structured data from (un)structured text

SYNOPSIS

    use Text::Scraper;

    use LWP::Simple;
    use Data::Dumper;

    my $tmpl = Text::Scraper->slurp(\*DATA);
    my $src  = get('http://search.cpan.org/recent') || die $!;
    
    my $obj  = Text::Scraper->new(tmpl => $tmpl);
    my $data = $obj->scrape($src);

    print "Newest Submission: ", $data->[0]{submissions}[0]{name},  "\n\n";
    print "Scraper model:\n",    Dumper($obj),                      "\n\n";
    print "Parsed  model:\n",    Dumper($data) ,                    "\n\n";

    __DATA__

    <div class=path><center><table><tr>
    <?tmpl stuff pre_nav ?>
    <td class=datecell><span><big><b> <?tmpl var date_string ?> </b></big></span></td>
    <?tmpl stuff post_nav ?>
    </tr></table></center></div>

    <ul>
    <?tmpl loop submissions ?>
     <li><a href="<?tmpl var link ?>"><?tmpl var name ?></a>
      <?tmpl if has_description ?>
      <small> -- <?tmpl var description ?></small>
      <?tmpl end has_description ?>
     </li>
    <?tmpl end submissions ?>
     </ul>

ABSTRACT

Text::Scraper provides a fully functional base-class to quickly develop Screen-Scrapers and other text extraction tools. Using templates, the programmer is freed from staring at fragile, heavily escaped regular expressions, mapping capture groups to named variables or wrestling with badly formed HTML. Machine generated output such as dynamic webpages are trivially reverse engineered.

Text::Scraper's functionality overlaps some existing CPAN modules - Template::Extract and WWW::Scraper.

Text::Scraper is significantly more lightweight than either. It has no dependencies on other frameworks, modules or design-decisions and has a more general application domain than the latter. Text::Scraper already benchmarks around 250% faster than Template::Extract and uses significantly less memory.

Unlike both existing modules, Text::Scraper generalizes its functionality to allow the programmer to refine template capture groups beyond (.*?), fully redefine the template syntax and introduce new template constructs bound to custom classes.

BACKGROUND

Using templates is a popular method of seperating visual presentation from programming logic - particularly popular in programs generating dynamic webpages. Text::Scraper reverses this process, using templates to extract the data back out of the surrounding presentation.

If you are familiar with templating concepts, then the SYNOPSIS should be sufficient to get you started. If not, I would recommend reading the documentation for HTML::Template - a module thats syntax and terminology is very similar to Text::Scraper's.

DESCRIPTION

Template Tags are classed as Leaves or Branches. Like XML, Branches must have an associated closing tag, Leaves must not. The default syntax is based on the XML preprocessor syntax:

    <?tmpl TYPE NAME [ATTRIBUTES] ?>
    

and for Branches:

    <?tmpl TYPE NAME [ATTRIBUTES] ?>  
        ...  
    <?tmpl end NAME ?>    

By default, Tags must be named and any closing tag must include the name of the opening tag it is closing. Attributes have the same syntax as XML attributes - but (similar to Perl regular expressions) can use any non-bracket punctuation character as quotation delimiters:

    <?tmpl var foo bar="baz" blah=/But dont "quote" me on that!/ ?> 

The only attribute acted on by the default tag classes is regex - used to refine how the Tag is translated into a regular-expression capture group:

    <?tmpl var naive_email_address  regex="([\w\d\.]+\@[\w\d\.]+?)"  ?>

This can be used to further filter the parsed data - similar to using grep:

    <?tmpl var foocom_email_address regex="([\w\d\.]+@(?:foo\.com))" ?>

Each tag should create only one capture group - but it is fine to make the outer group non-capturing:

    <?tmpl var date_just_month regex="(?:\d+ (\S+) \d+)" ?>

The above would capture only the month field in dates formated as 02 July 1979.

Default Tags

The default tags provided by Text::Scraper are typical for basic scraping but can be subclassed for additional functionality. By default, Leaf nodes return Scalars and Branch nodes return Arrays of Hashes - each element mapping to a matching sub-sequence. Blessing or filtering this data is left as an exercise for subclasses.

All the default tags are demonstrated in the SYNOPSIS:

var

Vars represent strings of text in a template. They are instances of Text::Scraper::Leaf.

stuff

Stuff tags represent spans of text that are of no interest in the extracted data, but can ease parsing in certain situations. They are instances of Text::Scraper:Ignorable - a subclass of Text::Scraper::Leaf.

loop

Loops represent repeated information in a template and are extracted as an array of hashes. They are instances of Text::Scraper::Branch.

if

A conditional region in the template. If not present, the parent scope will contain a false value under the tags name. Otherwise the value will be true and any tags inside the if's scope will be exported to its parent scope also.

These are instances of Text::Scraper::Conditional.

Text::Scraper API

These methods are sufficient for a basic scraping session:

Text::Scraper->slurp( STRING|GLOBREF )

Static utility method to return either a filename or filehandle as a string

Text::Scraper->new(HASH)

Returns a new Text::Scraper object. Optional parameters are:

tmpl SCALAR

A template as a scalar string

syntax SCALAR

A Text::Scraper::Syntax instance. See "Defining a custom syntax".

$obj->compile(STRING)

Only required for recompilation or if no tmpl parameter is passed to the constructor.

$obj->scrape(STRING)

Extract data from STRING based on template.

Subclass API

In addition to inheriting the above methods, certain hooks are available to subclasses:

$subclass->on_create()

General construction callback. Text::Scraper objects are prototype based so overriding the constructor is not recommended. Objects are hash based and any constructor arguments become attributes of the new instance before invoking this method.

$subclass->on_data(SCALAR)

This is the subclasses opertunity to bless or otherwise process any parsed data.

Because Text::Scraper objects are prototype based, a subclass can inherit the scraping logic and also encapsulate any particular instance of the scraped data. During compilation, an instance of each tag type is created as the prototype object. Its attributes will be related to the tag, its sub-template and any user-supplied tag attributes. During scraping, each prototype node is invoked to scrape its own portion of the input text against its sub-template. The return value from on_data is added to the generated output data-structure. By default these values are just returned unblessed.

Prototypes can easily bless captured data into the same class, for example:

Text::Scraper::Leaf

SCALAR is the captured text.

    sub on_data
    {
        my ($self, $match) = @_;
        return $self->new(value => $match);
    }
Text::Scraper::Branch

SCALAR will be a reference to an array of hashes.

    sub on_data
    {
        my ($self, $matches) = @_;
        @$matches = map {  $self->new(%$_)  } @$matches;
        return $matches;
    }

$subclass->to_regex()

Returns this nodes representation as a regular expression, to be used in a compiled template. If you find yourself using a particular regex attribute a lot, it will be easier to define a custom tag that overloads this method.

$subclass->ignore()

Returns a boolean value stating whether the parser should ignore the data captured by this object.

$subclass->proto() $subclass->proto(SCALAR)

Utility method to allow Tag instances to access (attributes of) their prototype. This can be safely called from a prototype object, which just points to itself. By default, data instances are not blessed and cannot use this method.

$subclass->nodes()

Returns instance data in-order, including any present conditional data.

Defining a custom syntax

The two areas of customization are Tag Syntax and Tag Classes. The defaults are encapsulated in the Text::Scraper::Syntax class.

The interested reader is encouraged to copy the source of the default syntax class and play around with changes. All the over-ridable methods begin with define_* and are fairly self explanatory or well commented.

Any new Tag classes should be subclassed from either Text::Scraper::Leaf, Text::Scraper::Branch, Text::Scraper::Ignorable or Text::Scraper::Conditional.

BUGS & CAVEATS

Rather than write a slow parser in pure Perl, Text::Scraper farms a lot of the work out to Perl's optimized regular-expression engine. This works well in general but, unfortunately, doesn't allow for a lot of error feedback during scraping. A fair understanding of the pros and cons of using regular expressions in this manner can be beneficial, but is outside the scope of this documentation.

Data::Dumper can be indespensible in following the success of your scraping. It can be safely applied to a Text::Scraper instance to analyze the parser's object model, or to the return value from a scrape() invokation to analyze what was parsed.

Bug reports and suggestions welcome.

AUTHOR

Copyright (C) 2005 Chris McEwan - All rights reserved.

Chris McEwan <mcewan@cpan.org>

LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.