The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WWW::Sitebase::Navigator - Base class for modules that navigate web sites

VERSION

Version 0.11

SYNOPSIS

This module is a base class for modules that navigate web sites like Myspace or Bebo. It provides basic methods like get_page and submit_form that are more robsut than their counterparts in WWW::Mechanize. It also provides some core methods like "site_login". If you subclass this module and override the "site_info" method, you'll have a module that can log into your web site. Ta Da.

Note that this module is a subclass of "Spiffy" using "use Spiffy -Base". perldoc Spiffy for more info or look it up on CPAN. Most importantly this means we use Spiffy's "field" method to create accessor methods, you don't need to include "my $self = shift" in your methods, and you can use "super" to call the base class's version of an overridden method.

    use WWW::Sitebase::Navigator -Base;

    field site_info => {
        home_page => 'http://www.myspace.com', # URL of site's homepage
        account_field => 'email', # Fieldname from the login form
        password_field => 'password', # Password fieldname
        cache_dir => '.www-MYSITE',
        login_form_name => 'login', # The name of the login form.  OR
        login_form_no => 1, # The number of the login form (defaults to 1).
                            # 1 is the first form on the page.
        login_verify_re => 'Welcome.*view my profile', # (optional)
            # Non-case-sensitive RE we should see once we're logged in
        not_logged_in_re => '<title>Sign In<\/title>',
            # If we log in and it fails (bad password, account suddenly
            # gets logged out), the page will have this RE on it.
            # Case insensitive.
        home_uri_re => '\?fuseaction=user&',
            # _go_home uses this and the next two items to load
            # the home page.  You can provide these options or
            # just override the method.
            # First, this is matched against the current URL to see if we're
            # already on the home page.
        home_link_re => 'fuseaction=user',
            # If we're not on the home page, this RE is 
            # used to find a link to the "Home" button on the current
            # page.
        home_url => 'http://www.myspace.com?fuseaction=user',
            # If the "Home" button link isn't found, this URL is
            # retreived.
        error_regexs => [
            'An unexpected error has occurred',
            'Site is temporarily down',
            'We hired monkeys to program our site, please wait '.
                'while they throw bananas at each other.'
        ],
            # error_regexs is optional.  If the site you're navigating
            # displays  error pages that do not return proper HTTP Status
            # codes (i.e. returns a 200 but displays an error), you can enter
            # REs here and any page that matches will be retried.
            # This is meant for IIS and ColdFusion-based sites that
            # periodically spew error messages that go away when tried again.
    };

IMPORTANT: If the site your module navigates uses ANY SSL, you'll need to add "Crypt::SSLEay" or "IO::Socket::SSL" to your list of prerequisite modules. Otherwise your methods will die if they hit an SSL-encrypted page. WWW::Sitebase::Navigator doesn't require this for you to prevent unnecessary overhead for sites that don't need it.

OPTIONS

default_options

Override this method to allow additional options to be passed to "new". You should also provide accessor methods for them. These are parsed by Params::Validate. In breif, setting an option to "0" means it's optional, "1" means it's required. See Params::Validate for more info.

    sub default_options {
        $self->{default_options}={
            account_name => 0,
            password => 0,
            cache_dir => 0,  # Default set by site_info field method
            cache_file => 0, # Default set by field method below
            auto_login => 0, # Default set by field method below
            human => 0,      # Default set by field method below
            config_file => 0
        };

        return $self->{default_options};
    }

    # So to add a "questions" option that's mandatory:

    sub default_options {
        super;
        $self->{default_options}->{questions}=1;
        return $self->{default_options};
    }

positional_parameters

You can also allow your users to provide information to the "new" method via positional parameters. If the first argument passed to "new" is not a known valid option, positional parameters are used instead.

These default to:

 const positional_parameters => [ 'account_name', 'password' ];

You can override this method to provide your own list if you like:

 const positional_parameters => [ 'account_name', 'password', 'shoe_size' ];

OPTION ACCESSORS

These methods can be used to set/retreive the respective option's value. They're also up top here to document the option, which can be passed directly to the "new" method.

account_name

Sets or returns the account name (email address) under which you're logged in. Note that the account name is retreived from the user or from your program depending on how you called the "new" method. You'll probably only use this accessor method to get account_name.

EXAMPLE

The following would prompt the user for their login information, then print out the account name:

    use WWW::Bebo;
    my $bebo = new WWW::Bebo;
    
    print $site->account_name;

    $site->account_name( 'other_account@bebo.com' );
    $site->password( 'other_accounts_password' );
    $site->site_login;

WARNING: If you do change account_name, make sure you change password and call site_login. Changing account_name doesn't (currently) log you out, nor does it clear "password". If you change this and don't log in under the new account, it'll just have the wrong value, which will probably be ignored, but who knows.

password

Sets or returns the password you used, or will use, to log in. See the warning under "account_name" above - same applies here.

cache_dir

WWW::Sitebase::Navigator stores the last account/password used in a cache file for convenience if the user's entering it. Other modules store other cache data as well.

cache_dir sets or returns the directory in which we should store cache data. Defaults to $self->site_info->{cache_dir}.

If using this from a CGI script, you will need to provide the account and password in the "new" method call, or call "new" with "auto_login => 0" so cache_dir will not be used.

cache_file

Sets or returns the name of the file into which the login cache data is stored. Defaults to login_cache.

If using this from a CGI script, you will need to provide the account and password in the "new" method call, so cache_file will not be used.

auto_login

Really only useful as an option passed to the "new" method when creating a new object.

 # Create a new object and prompt the user to log in.
 my $site = new WWW::MySite( auto_login => 1 );

human

When set to a true value (which is the default), adds delays to make the module act more like a human. This is both to offset "faux security" measures, and to conserve bandwidth. If you're trying to use multiple accounts to spam users who don't want to hear what you have to say, you should turn this off because it'll make your spamming go faster.

use_defaults

When set to a true value, cached username and password will be used, and the user will only be prompted for a username and password if one or both aren't already stored.

FUNCTIONS

new( $account, $password )

new( )

If called without the optional account and password, the new method looks in a user-specific preferences file in the user's home directory for the last-used account and password. It prompts for the username and password with which to log in, providing the last-used data (from the preferences file) as defaults.

Once the account and password have been retreived, the new method automatically invokes the "site_login" method and returns a new object reference. The new object already contains the content of the user's "home" page, the user's friend ID, and a WWW::Mechanize object used internally as the "browser" that is used by all methods in the class.

If account_name and password are specified, the "new" method will set auto_login to 1 and call the "site_login" method. This just means that if you pass an account_name and password when creating the object, it'll log you in unless you explicitly state "auto_login => 0".

WWW::Sitebase::Navigator is a subclass of WWW::Sitebase, which basically just means people can call your "new" method in many ways:

    EXAMPLES
        use WWW::YourSiteModule;
        
        # Just create the object
        my $site = new WWW::YourSiteModule;
        
        # Prompt for username and password
        my $site = new WWW::YourSiteModule( auto_login => 1 );

        # Pass just username and password (logs you in)
        my $site = new WWW::YourSiteModule( 'my@email.com', 'mypass' );
        
        # Pass options as a hashref
        my $site = new WWW::YourSiteModule( {
            account_name => 'my@email.com',
            password => 'mypass',
            cache_file => 'passcache',
        } );
        
        # Pass options as a hash
        my $site = new WWW::YourSiteModule(
            account_name => 'my@email.com',
            password => 'mypass',
            cache_file => 'passcache',
            auto_login => 0,  # Don't log in, just create the object)
        );

site_login

Logs into the account identified by the "account_name" and "password" options.

If you don't call the new method with "login => 1", you'll need to call this method if you want to log in.

If the login gets a "you must be logged-in" page when you first try to log in, $site->error will be set to an error message that says to check the username and password.

Once login is successful for a given username/password combination, the object "remembers" that the username/password is valid, and if it encounters a "you must be logged-in" page, it will try up to 20 times to re-login.

_submit_login

This method just calls submit_form with the values specified in site_info. It's been separated out just in case you have a sticky login form and you want to override this method to do something fancy. The other option was to give a lot more options in site_info, but to really give the amount of control you might need, it just makes more sense to set up site_info for the usual cases, and override this method if you need to get fancy.

You must return 1 for success, 0 for failure. All you really need to do is this:

    # Submit the login form
    my $submitted = $self->submit_form(
                    page => $self->site_info->{'home_page'},
                    form_name => $self->site_info->{'login_form_name'},
                    form_no => $self->site_info->{'login_form_no'},
                    fields_ref => {
                      $self->site_info->{'account_field'} => $self->account_name,
                      $self->site_info->{'password_field'} => $self->password
                    }
                  );

    return $submitted;

And fill in your special values instead. Again, only do this if your login doesn't work with the stuff you set up in site_info.

_check_login

Checks for "You must be logged in to do that". If found, tries to log in again and returns 0, otherwise returns 1.

logout

Clears the current web browsing object and resets any login-specific internal values. Currently this drops and creates a new WWW::Mechanize object. This may change in the future to actually clicking "logout" or something.

CHECK STATUS

logged_in

Returns true if login was successful. When you call the new method of WWW::Sitebase::Navigator, the class logs in using the username and password you provided (or that it prompted for). It then retreives your "home" page (the one you see when you click the "Home" button that's set up in your site_info field), and checks it against an RE. If the page matches the RE, logged_in is set to a true value. Otherwise it's set to a false value.

 Notes:
 - This method is only set on login. If you're logged out somehow,
   this method won't tell you that (yet - I may add that later).
 - The internal login method calls this method to set the value.
   You can (currently) call logged_in with a value, and it'll set
   it, but that would be stupid, and it might not work later
   anyway, so don't.

 Examples, pretending we have a subclass named WWW::Bebo to navigate a site
 named bebo.com:

 my $bebo = new WWW::Bebo;
 unless ( $site->logged_in ) {
    die "Login failed\n";
 }
 
 # This will log you in, looping forever until it succeeds.
 my $bebo;

 do {
    $bebo = new WWW::Bebo( $username, $password );
 } until ( $site->logged_in );

error

This value is set by some methods to return an error message. If there's no error, it returns a false value, so you can do this:

 $site->get_profile( 12345 );
 if ( $site->error ) {
     warn $site->error . "\n";
 } else {
     # Do stuff
 }

current_page

Returns a reference to an HTTP::Response object that contains the last page retreived by the WWW::Sitebase::Navigator object. All methods (i.e. get_page, post_comment, get_profile, etc) set this value.

EXAMPLE

The following will print the content of the user's profile page:

    use WWW::Bebo;
    my $bebo = new WWW::Bebo;
    
    print $site->current_page->decoded_content;

mech

The internal WWW::Mechanize object. Use at your own risk: I don't promose this method will stay here or work the same in the future. The internal methods used to access sites are subject to change at any time, including using something different than WWW::Mechanize.

get_page( $url, [ %options ] )

get_page returns a referece to a HTTP::Response object that contains the web page specified by $url.

get_page will try up to 20 times until it gets the page, with a 2-second delay between attempts. It checks for invalid HTTP response codes, and error pages as defined in site_info->{error_regexps}.

Options can be:

 re => $regular_expression
 follow => 1

"re" Is a regular expression. If provided, get_page will consider the page an error unless the page content matches the regexp. This is designed to get past network problems and such.

If "follow" is set, a "Referer" header will be added, simulating clicking on a link on the current page to get to the URL provided.

EXAMPLE

    # The following displays the HTML source of MySpace.com's home
    # page, verifying that there is evidence of a login form on the
    # retreived page.
    my $res=get_page( "http://www.myspace.com/", re => 'E-Mail:.*?Password:' );
    
    print $res->decoded_content;

follow_to( $url, $regexp )

Convenience method that calls get_page with follow => 1. Use this if you're stepping through pages.

This is the method you "should" use to navigate your sites, as it's the most "human"-looking.

This is like a robust version of WWW::Mechanize's "follow_link" method. It calls "find_link" with your arguments (and as such takes the same arguments. It adds the "re" argument, which is passed to get_page to verify we in fact got the page. Returns an HTTP::Response object if it succeeds, sets $self->error and returns undef if it fails.

    $self->_go_home;
    $self->follow_link( text_regex => qr/inbox/i, re => 'Mail Center' )
        or die $self->error;

There are a lot of options, so perldoc WWW::Mechanize and search for $mech->find_link to see them all.

_cache_page( $url, $res )

Stores $res in a cache.

_read_cache( $url )

Check the cache for this page.

_clean_cache

Cleans any non-"fresh" page from the cache.

submit_form( %options )

 Valid options:
 $site->submit_form(
    page => "http://some.url.org/formpage.html",
    form_no => 1,
    form_name => "myform",  # Use this OR form_no OR form
    form => $form, # HTML::Form object with a ready-to-post form.
                   # (page, form_no, form_name, fields_ref and action will
                   # be ignored).
    button => "mybutton",
    no_click => 0,  # 0 or 1.
    fields_ref => { field => 'value', field2 => 'value' },
    re1 => 'something unique.?about this[ \t\n]+page',
    re2 => 'something unique about the submitted page',
    action => 'http://some.url.org/newpostpage.cgi', # Only needed in weird occasions
 );

This powerful little method reads the web page specified by "page", finds the form specified by "form_no" or "form_name", fills in the values specified in "fields_ref", and clicks the button named "button".

You may or may not need this method - it's used internally by any method that needs to fill in and post a form. I made it public just in case you need to fill in and post a form that's not handled by another method (in which case, see CONTRIBUTING below :).

"page" can either be a text string that is a URL or a reference to an HTTP::Response object that contains the source of the page that contains the form. If it is an empty string or not specified, the current page ( $site->current_page ) is used.

"form_no" is used to numerically identify the form on the page. It's a simple counter starting from 1. If there are 3 forms on the page and you want to fill in and submit the second form, set "form_no => 2". For the first form, use "form_no => 1".

"form_name" is used to indentify the form by name. In actuality, submit_form simply uses "form_name" to iterate through the forms and sets "form_no" for you.

"form" can be used if you have a customized form you want to submit. Pass an HTML::Form object and set "button", "no_click", and "re2" as desired, and you can use submit_form's tenacious submission routine with your own values.

"button" is the name of the button to submit. This will frequently be "submit", but if they've named the button something clever like "Submit22" (as MySpace did in their login form), then you may have to use that. If no button is specified (either by button => '' or by not specifying button at all), the first button on the form is clicked.

If "no_click" is set to 1, the form willl be submitted without clicking any button. This is used to simulate the JavaScript form submits Myspace does on the browse pages.

"fields_ref" is a reference to a hash that contains field names and values you want to fill in on the form. For checkboxes with no "value" attribute, specify a value of "on" to check it, "off" to uncheck it.

"re1" is an optional Regular Expression that will be used to make sure the proper form page has been loaded. The page content will be matched to the RE, and will be treated as an error page and retried until it matches. See get_page for more info.

"re2" is an optional RE that will me used to make sure that the post was successful. USE THIS CAREFULLY! If your RE breaks, you could end up repeatedly posting a form.

"action" is the post action for the form, as in:

 <form action="http://www.mysite.com/process.cgi">

This is here because Myspace likes to do weird things like reset form actions with Javascript then post them without clicking form buttons.

EXAMPLE

This is how WWW::Myspace's post_comment method posted a comment:

    # Submit the comment to $friend_id's page
    $self->submit_form( "${VIEW_COMMENT_FORM}${friend_id}", 1, "submit",
                        { 'f_comments' => "$message" }, '', 'f_comments'
                    );
    
    # Confirm it
    $self->submit_form( "", 1, "submit", {} );

_add_to_form

Internal method to add a hidden field to a form. HTML::Form thinks we don't want to change hidden fields, and if a hidden field has no value, it won't even create an input object for it. If that's way over your head don't worry, it just means we're fixing things with this method, and submit_form will call this method for you if you pass it a field that doesn't show up on the form.

Returns a form object that is the old form with the new field in it.

 # Add field $fieldname to form $form (a HTML::Form object) and
 # set it's value to $value.
 $self->_add_to_form( $form, $fieldname, $value )

_go_home

Internal method to go to the home page. Checks to see if we're already there. If not, tries to click the Home button on the page. If there isn't one, loads the page explicitly.

make_cache_dir

Creates the cache directory in cache_dir. Only creates the top-level directory, croaks if it can't create it.

    $myspace->cache_dir("/path/to/dir");
    $myspace->make_cache_dir;

This function mainly exists for the internal login method to use, and for related sub-modules that store their cache files by default in WWW:Myspace's cache directory.

debug( message );

Use this method to turn on/off debugging output.

AUTHOR

Grant Grueninger, <grantg at cpan.org>

BUGS

Please report any bugs or feature requests to bug-www-bebo at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=WWW-Bebo. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc WWW::Bebo

You can also look for information at:

ACKNOWLEDGEMENTS

COPYRIGHT & LICENSE

Copyright 2006 Grant Grueninger, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.