Evan Carroll


HTML::TreeBuilderX::ASP_NET - Scrape ASP.NET/VB.NET sites which utilize Javascript POST-backs.


        my $ua = LWP::UserAgent->new;
        my $resp = $ua->get('http://uniqueUrl.com/Server.aspx');
        my $root = HTML::TreeBuilder->new_from_content( $resp->content );
        my $a = $root->look_down( _tag => 'a', id => 'nextPage' );
        my $aspnet = HTML::TreeBuilderX::ASP_NET->new({
                element   => $a
                , baseURL =>$resp->request->uri ## takes into account posting redirects
        my $resp = $ua->request( $aspnet->httpResponse );

        ## or the easy cheating way see the SEE ALSO section for links
        my $aspnet = HTML::TreeBuilderX::ASP_NET->new_with_traits( traits => ['htmlElement'] );
        $form->look_down(_tag=> 'a')->httpResponse


Scrape ASP.NET sites which utilize the language's __VIEWSTATE, __EVENTTARGET, __EVENTARGUMENT, __LASTFOCUS, et al. This module returns a HTTP::Response from the form with the use of the method ->httpResponse.

In this scheme many of the links on a webpage will apear to be javascript functions. The default Javascript function is __doPostBack(eventTarget, eventArgument). ASP.NET has two hidden fields which record state: __VIEWSTATE, and __LASTFOCUS. It abstracts each link with a method that utilizes an HTTP post-back to the server. The Javascript behind __doPostBack simply appends __EVENTTARGET=$eventTarget&__EVENTARGUMENT=$eventArgument onto the POST request from the parent form and submits it. When the server receives this request it decodes and decompresses the __VIEWSTATE and uses it along with the new __EVENTTARGET and __EVENTARGUMENT to perform the action, which is often no more than serializing the data back into the __VIEWSTATE.

Sometimes developers cloak the __doPostBack(target,arg) with names akin to changepage(arg) which simply call __doPostBack("target", arg). This module will handle this use case as well using the explicit an eventTriggerArugment in the constructor.

This flow is a bane on RESTLESS http and makes no sense whatsoever. Thanks Microsoft.

      |                            HTML FORM 1                            |
      | <form action="Server.aspx" method="post">                         |
      | <input type="hidden" name="__VIEWSTATE" value="encryptedXML-FOO"> |
      | <a>1</a> |                                                        |
      | <a href="javascript:__doPostBack('gotopage','2')">2</a>           |
      | ...                                                               |
                       \                                \
                        ) User clicks the link named "2" )
   | POST http://aspxnonsensery/Server.aspx                                 |
   | Content-Length: 2659                                                   |
   | Content-Type: application/x-www-form-urlencoded                        |
   |                                                                        |
    |                             HTML FORM 2                              |
    |                       (different __VIEWSTATE)                        |
    | <form action="Server.aspx" method="post">                            |
    | <input type="hidden" name="__VIEWSTATE" value="encryptedXML-BAR">    |
    | <a href="javascript:__doPostBack('gotopage','1')">1</a> |            |
    | <a>2</a>                                                             |
    | ...                                                                  |



->new({ hashref })

Takes a HashRef, returns a new instance some of the possible key/values are:

form => $htmlElement

optional: You explicitly send the HTML::Elmenet representing the form. If you do not one will be implicitly deduced from the $self->element, making element=>$htmlElement a requirement

eventTriggerArgument => $hashRef

Not needed if you supply an element. This takes a HashRef and will create HTML::Elements that mimmick hidden input fields. From which to tack onto the $self->form.

element => $htmlElement

Not needed if you send an eventTriggerArgument. Attempts to deduce the __EVENTARGUMENT and __EVENTTARGET from the 'href' attribute of the element just as if the two were supplied explicitly. It will also be used to deduce a form by looking up in the HTML tree if one is not supplied.

debug => *0|1

optional: Sends the debug flag H:R:F, default is off.

baseURL => $uri

optional: Sets the base of the URL for the post action


Returns an HTTP::Request object for the HTTP POST


Explicitly return the underlying HTTP::Request::Form object. All methods fallback here anyway, but this will return that object directly.


None of these are exported...

createInputElements( {eventTarget => eventArgument} )

Helper function takes two values in an HashRef. Assumes the key is the __EVENTTARGET and value the __EVENTARGUMENT, returns two HTML::Element pseudo-input fields with the information.

parseDoPostBack( $str )

Accepts a string that is often the "href" attribute of an HTTP::Element. It simple parses out the call to Javascript, using regexes, and makes the two args useable to perl in the form of an HashRef.



For an easy way to glue the two together


For the object the method htmlElement returns


For a base class, to which all methods are valid


For the base class of all HTML tokens


Evan Carroll, <me at evancarroll.com>


None, though *much* more support should be added to ->element. Not everthing is a simple anchor tag.


You can find documentation for this module with the perldoc command.

perldoc HTML::TreeBuilderX::ASP_NET

You can also look for information at:


Copyright 2008 Evan Carroll, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.