HTML::TreeBuilderX::ASP_NET - Scrape ASP.NET/VB.NET sites which utilize Javascript POST-backs.
my $ua = LWP::UserAgent->new; my $resp = $ua->get('http://uniqueUrl.com/Server.aspx'); my $root = HTML::TreeBuilder->new_from_content( $resp->content ); my $a = $root->look_down( _tag => 'a', id => 'nextPage' ); my $aspnet = HTML::TreeBuilderX::ASP_NET->new({ element => $a , baseURL =>$resp->request->uri ## takes into account posting redirects }); my $resp = $ua->request( $aspnet->httpResponse ); ## or the easy cheating way see the SEE ALSO section for links my $aspnet = HTML::TreeBuilderX::ASP_NET->new_with_traits( traits => ['htmlElement'] ); $form->look_down(_tag=> 'a')->httpResponse
Scrape ASP.NET sites which utilize the language's __VIEWSTATE, __EVENTTARGET, __EVENTARGUMENT, __LASTFOCUS, et al. This module returns a HTTP::Response from the form with the use of the method ->httpResponse.
->httpResponse
In this scheme many of the links on a webpage will apear to be javascript functions. The default Javascript function is __doPostBack(eventTarget, eventArgument). ASP.NET has two hidden fields which record state: __VIEWSTATE, and __LASTFOCUS. It abstracts each link with a method that utilizes an HTTP post-back to the server. The Javascript behind __doPostBack simply appends __EVENTTARGET=$eventTarget&__EVENTARGUMENT=$eventArgument onto the POST request from the parent form and submits it. When the server receives this request it decodes and decompresses the __VIEWSTATE and uses it along with the new __EVENTTARGET and __EVENTARGUMENT to perform the action, which is often no more than serializing the data back into the __VIEWSTATE.
__doPostBack(eventTarget, eventArgument)
__doPostBack
Sometimes developers cloak the __doPostBack(target,arg) with names akin to changepage(arg) which simply call __doPostBack("target", arg). This module will handle this use case as well using the explicit an eventTriggerArugment in the constructor.
__doPostBack(target,arg)
changepage(arg)
__doPostBack("target", arg)
This flow is a bane on RESTLESS http and makes no sense whatsoever. Thanks Microsoft.
.-------------------------------------------------------------------. | HTML FORM 1 | | <form action="Server.aspx" method="post"> | | <input type="hidden" name="__VIEWSTATE" value="encryptedXML-FOO"> | | <a>1</a> | | | <a href="javascript:__doPostBack('gotopage','2')">2</a> | | ... | '-------------------------------------------------------------------' | v _________________________________ \ \ ) User clicks the link named "2" ) /________________________________/ | v .------------------------------------------------------------------------. | POST http://aspxnonsensery/Server.aspx | | Content-Length: 2659 | | Content-Type: application/x-www-form-urlencoded | | | | __VIEWSTATE=encryptedXML-FOO&__EVENTTARGET=gotopage1&__EVENTARGUMENT=2 | '------------------------------------------------------------------------' | v .----------------------------------------------------------------------. | HTML FORM 2 | | (different __VIEWSTATE) | | <form action="Server.aspx" method="post"> | | <input type="hidden" name="__VIEWSTATE" value="encryptedXML-BAR"> | | <a href="javascript:__doPostBack('gotopage','1')">1</a> | | | <a>2</a> | | ... | '----------------------------------------------------------------------'
IN ADDITION TO ALL OF THE METHODS FROM HTTP::Request::Form
Takes a HashRef, returns a new instance some of the possible key/values are:
optional: You explicitly send the HTML::Elmenet representing the form. If you do not one will be implicitly deduced from the $self->element, making element=>$htmlElement a requirement
Not needed if you supply an element. This takes a HashRef and will create HTML::Elements that mimmick hidden input fields. From which to tack onto the $self->form.
Not needed if you send an eventTriggerArgument. Attempts to deduce the __EVENTARGUMENT and __EVENTTARGET from the 'href' attribute of the element just as if the two were supplied explicitly. It will also be used to deduce a form by looking up in the HTML tree if one is not supplied.
optional: Sends the debug flag H:R:F, default is off.
optional: Sets the base of the URL for the post action
Returns an HTTP::Request object for the HTTP POST
Explicitly return the underlying HTTP::Request::Form object. All methods fallback here anyway, but this will return that object directly.
None of these are exported...
Helper function takes two values in an HashRef. Assumes the key is the __EVENTTARGET and value the __EVENTARGUMENT, returns two HTML::Element pseudo-input fields with the information.
Accepts a string that is often the "href" attribute of an HTTP::Element. It simple parses out the call to Javascript, using regexes, and makes the two args useable to perl in the form of an HashRef.
For an easy way to glue the two together
For the object the method htmlElement returns
For a base class, to which all methods are valid
For the base class of all HTML tokens
Evan Carroll, <me at evancarroll.com>
<me at evancarroll.com>
None, though *much* more support should be added to ->element. Not everthing is a simple anchor tag.
You can find documentation for this module with the perldoc command.
perldoc HTML::TreeBuilderX::ASP_NET
You can also look for information at:
RT: CPAN's request tracker
http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-TreeBuilderX-ASP_NET
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/HTML-TreeBuilderX-ASP_NET
CPAN Ratings
http://cpanratings.perl.org/d/HTML-TreeBuilderX-ASP_NET
Search CPAN
http://search.cpan.org/dist/HTML-TreeBuilderX-ASP_NET
Copyright 2008 Evan Carroll, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install HTML::TreeBuilderX::ASP_NET, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::TreeBuilderX::ASP_NET
CPAN shell
perl -MCPAN -e shell install HTML::TreeBuilderX::ASP_NET
For more information on module installation, please visit the detailed CPAN module installation guide.