TaskPipe::Task_Scrape - Base TaskPipe class for scraping a webpage


This is the standard building block for creating a webpage-scraping task. To do this inherit from Task::Scrape using the following package format:

    package TaskPipe::Task_Scrape_MyScraper;

    use Moose;
    use Web::Scraper;
    extends 'TaskPipe::Task_Scrape';

    has test_pinterp => (is => 'ro', isa => 'ArrayRef[HashRef], default => sub{[

            url => '',
            headers => {
                Referer => ''

    has ws => (is => 'ro', isa => 'Web::Scraper', default => sub{
        scraper {
            process 'div.some-class', 'results' => 'TEXT';
            result 'results'

    sub post_process {  # may or may not be necessary, depending
                        # on what is returned by ws
        my ($self,$results) = @_;

        # do something with the results returned from the web scraper

        return $results;

test_pinterp allows you to specify test data which you can run the task against by typing

    taskpipe test task --name=Scrape_MyScraper

at the command line.

It is assumed you want to use a Web::Scraper to scrape your page. If this is the case, just define a ws attribute as above. See the Web::Scraper manpage for more information on how to define a Web::Scraper.

Your task needs to return an arrayref of results (each result being a hashref). It's great if you can get ws to return this directly. Sometimes it is not possible to persuade your Web::Scraper to return results in this format. To make format corrections (remove records from the data etc) you can include a post_process subroutine. post_process receives the output from ws. Do what is needed, and make sure you return your results arrayref at the end.


Tom Gracey <>


Copyright (c) Tom Gracey 2018

TaskPipe is free software, licensed under

    The GNU Public License Version 3