The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Scrappy - All Powerful Web Spidering, Scrapering, Crawling Framework

VERSION

version 0.9111110

SYNOPSIS

    #!/usr/bin/perl
    use Scrappy;

    my  $scraper = Scrappy->new;
        $scraper->crawl('search.cpan.org',
            '/recent' => {
                '#cpansearch li a' => sub {
                    print $_[1]->{href}, "\n";
                }
            }
        );

DESCRIPTION

Scrappy is an easy (and hopefully fun) way of scraping, spidering, and/or harvesting information from web pages, web services, and more. Scrappy is a feature rich, flexible, intelligent web automation tool.

Scrappy (pronounced Scrap+Pee) == 'Scraper Happy' or 'Happy Scraper'; If you like you may call it Scrapy (pronounced Scrape+Pee) although Python has a web scraping framework by that name and this module is not a port of that one.

METHODS

crawl

The crawl method is very useful when it is desired to crawl an entire website or at-least partially, it automates the tasks of creating a queue, fetching and parsing html pages, and establishing simple flow-control. See the SYNOPSIS for a simplified example, ... the following is a more complex example.

    my  $scrappy = Scrappy->new;
    
    $scrappy->crawl('http://search.cpan.org/recent',
        '/recent' => {
            
            '#cpansearch li a' => sub {
                my ($self, $item) = @_;
                # follow all recent modules from search.cpan.org
                $self->queue->add($item->{href});
            }
            
        },
        '/~:author/:name-:version/' => {
            
            'body' => sub {
                my ($self, $item, $args) = @_;
                
                my $reviews = $self
                ->select('.box table tr')->focus(3)->select('td.cell small a')
                ->data->[0]->{text};
                
                $reviews = $reviews =~ /\d+ Reviews/ ?
                    $reviews : '0 reviews';
                
                print "found $args->{name} version $args->{version} ".
                    "[$reviews] by $args->{author}\n";
                
            }
            
        }
    );

AUTHOR

Al Newkirk <awncorp@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2010 by awncorp.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.