The Perl Advent Calendar needs more articles for 2022. Submit your idea today!

NAME

WWW::Mechanize - automate interaction with websites

SYNOPSIS

This module is intended to help you automate interaction with a website.

    use WWW::Mechanize;
    my $agent = WWW::Mechanize->new();

    $agent->get($url);

    $agent->follow($link);

    $agent->form_number($number);
    $agent->form_name($name);
    $agent->field($name, $value);
    $agent->click($button);

    $agent->back();

    $agent->add_header($name => $value);

    use Test::More;
    like( $agent->{content}, qr/$expected/, "Got expected content" );

VERSION

Version 0.39

    $Header: /home/cvs/www-mechanize/lib/WWW/Mechanize.pm,v 1.49 2003/04/02 05:26:13 alester Exp $

METHODS

new()

Creates and returns a new WWW::Mechanize object, hereafter referred to as the 'agent'.

    my $agent = WWW::Mechanize->new()

The constructor for WWW::Mechanize overrides two of the parms to the LWP::UserAgent constructor:

    agent => "WWW-Mechanize/#.##"
    cookie_jar => {}    # an empty, memory-only HTTP::Cookies object

You can override these overrides by passing parms to the constructor, as in:

    my $agent = WWW::Mechanize->new( agent=>"wonderbot 1.01" );

If you want none of the overhead of a cookie jar, or don't want your bot accepting cookies, you have to explicitly disallow it, like so:

    my $agent = WWW::Mechanize->new( cookie_jar => undef );

$agent->get($url)

Given a URL/URI, fetches it. Returns an HTTP::Response object.

The results are stored internally in the agent object, but you don't know that. Just use the accessors listed below. Poking at the internals is deprecated and subject to change in the future.

$agent->uri()

Returns the current URI.

$agent->req()

Returns the current request as an HTTP::Request object.

$agent->res()

Returns the current response as an HTTP::Response object.

$agent->status()

Returns the HTTP status code of the response.

$agent->ct()

Returns the content type of the response.

$agent->base()

Returns the base URI for the current response

$agent->content()

Returns the content for the response

$agent->forms()

Returns a reference to an array of HTML::Form objects for the forms found.

$agent->current_form()

Returns the current form as an HTML::Form object. I'd call this form() except that form() already exists and sets the current_form.

$agent->links()

Returns an arrayref of the links found

$agent->is_html()

Returns true/false on whether our content is HTML, according to the HTTP headers.

$agent->title()

Returns the contents of the <TITLE> tag, as parsed by HTML::HeadParser. Returns undef if the content is not HTML.

Action methods

$agent->follow($string|$num)

Follow a link. If you provide a string, the first link whose text matches that string will be followed. If you provide a number, it will be the nth link on the page.

Returns true if the link was found on the page or undef otherwise.

$agent->quiet(true/false)

Allows you to suppress warnings to the screen.

    $agent->quiet(0); # turns on warnings (the default)
    $agent->quiet(1); # turns off warnings
    $agent->quiet();  # returns the current quietness status

$agent->form($number|$name)

Selects a form by number or name, depending on if it gets passed an all-numeric string or not. If you have a form with a name that is all digits, you'll need to call $agent-form_name > explicitly.

$agent->form_number($number)

Selects the Nth form on the page as the target for subsequent calls to field() and click(). Emits a warning and returns false if there is no such form. Forms are indexed from 1, that is to say, the first form is number 1 (not zero).

$agent->form_name($number)

Selects a form by name. If there is more than one form on the page with that name, then the first one is used, and a warning is generated.

Note that this functionality requires libwww-perl 5.69 or higher.

$agent->field($name, $value, $number)

Given the name of a field, set its value to the value specified. This applies to the current form (as set by the form() method or defaulting to the first form on the page).

The optional $number parameter is used to distinguish between two fields with the same name. The fields are numbered from 1.

$agent->click($button, $x, $y)

Has the effect of clicking a button on a form. The first argument is the name of the button to be clicked. The second and third arguments (optional) allow you to specify the (x,y) cooridinates of the click.

If there is only one button on the form, $agent->click() with no arguments simply clicks that one button.

Returns an HTTP::Response object.

$agent->submit()

Submits the page, without specifying a button to click. Actually, no button is clicked at all.

This used to be a synonym for $a->click("submit"), but is no longer so.

$agent->back()

The equivalent of hitting the "back" button in a browser. Returns to the previous page. Won't go back past the first page. (Really, what would it do if it could?)

$agent->add_header(name => $value)

Sets a header for the WWW::Mechanize agent to use every time it gets a webpage. This is NOT stored in the agent object (because if it were, it would disappear if you went back() past where you'd set it) but in the hash variable %WWW::Mechanize::headers, which is a hash of all headers to be set. You can manipulate this directly if you want to; the add_header() method is just provided as a convenience function for the most common case of adding a header.

extract_links()

Extracts HREF links from the content of a webpage.

The return value is a reference to an array containing an array reference for every <A> and <FRAME> tag in $self->{content}.

The array elements for the <A> tag are:

[0]: contents of the href attribute
[1]: text enclosed by the <A> tag
[2]: the contents of the name attribute

The array elements for the <FRAME> tag are:

[0]: contents of the src attribute
[1]: text enclosed by the <FRAME> tag
[2]: contents of the name attribute

INTERNAL METHODS

These methods are only used internally. You probably don't need to know about them.

_push_page_stack() / _pop_page_stack()

The agent keeps a stack of visited pages, which it can pop when it needs to go BACK and so on.

The current page needs to be pushed onto the stack before we get a new page, and the stack needs to be popped when BACK occurs.

Neither of these take any arguments, they just operate on the $agent object.

_do_request()

Performs a request on the $self->{req} request object, and sets a bunch of attributes on $self.

Returns an HTTP::Response object.

EXAMPLES

Following are user-supplied samples of WWW::Mechanize in action. If you have samples you'd like to contribute, please send 'em.

You can also look at the t/*.t files in the distribution.

Please note that these examples are not intended to do any specific task. For all I know, they're no longer functional because the sites they hit have changed. They're here to give examples of how people have used WWW::Mechanize.

get-despair, by Randal Schwartz

Randal submitted this bot that walks the despair.com site sucking down all the pictures.

    use strict; 
    $|++;
     
    use WWW::Mechanize;
    use File::Basename; 
      
    my $m = WWW::Mechanize->new;
     
    $m->get("http://www.despair.com/indem.html");
     
    my @top_links = @{$m->links};
      
    for my $top_link_num (0..$#top_links) {
        next unless $top_links[$top_link_num][0] =~ /^http:/; 
         
        $m->follow($top_link_num) or die "can't follow $top_link_num";
         
        print $m->uri, "\n";
        for my $image (grep m{^http://store4}, map $_->[0], @{$m->links}) { 
            my $local = basename $image;
            print " $image...", $m->mirror($image, $local)->message, "\n"
        }
         
        $m->back or die "can't go back";
    }

Hacking Movable Type, by Dan Rinzel

    use WWW::Mechanize;

    # a tool to automatically post entries to a moveable type weblog, and set arbitary creation dates

    my $mech = WWW::Mechanize->new();
    my %entry;
    $entry->{title} = "Test AutoEntry Title";
    $entry->{btext} = "Test AutoEntry Body";
    $entry->{date} = '2002-04-15 14:18:00';
    my $start = qq|http://my.blog.site/mt.cgi|;

    $mech->get($start);
    $mech->field('username','und3f1n3d');
    $mech->field('password','obscur3d');
    $mech->submit(); # to get login cookie
    $mech->get(qq|$start?__mode=view&_type=entry&blog_id=1|);
    $mech->form('entry_form');
    $mech->field('title',$entry->{title});
    $mech->field('category_id',1); # adjust as needed
    $mech->field('text',$entry->{btext});
    $mech->field('status',2); # publish, or 1 = draft
    $results = $mech->submit(); 

    # if we're ok with this entry being datestamped "NOW" (no {date} in %entry)
    # we're done. Otherwise, time to be tricksy
    # MT returns a 302 redirect from this form. the redirect itself contains a <body onload=""> handler
    # which takes the user to an editable version of the form where the create date can be edited       
    # MT date format of YYYY-MM-DD HH:MI:SS is the only one that won't error out

    if ($entry->{date} && $entry->{date} =~ /^\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2}/) {
        # travel the redirect
        $results = $mech->get($results->{_headers}->{location});
        $results->{_content} =~ /<body onLoad="([^\"]+)"/is;
        my $js = $1;
        $js =~ /\'([^']+)\'/;
        $results = $mech->get($start.$1);
        $mech->form('entry_form');
        $mech->field('created_on_manual',$entry->{date});
        $mech->submit();
    }

REQUESTS & BUGS

Please report any requests, suggestions or (gasp!) bugs via the system at http://rt.cpan.org/, or email to bug-WWW-Mechanize@rt.cpan.org. This makes it much easier for me to track things.

AUTHOR

Copyright 2002 Andy Lester <andy@petdance.com>

Released under the Artistic License. Based on Kirrily Robert's excellent WWW::Automate package.