HTML::AsHTML - Return The same HTML document as was put in.


The HTML::AsHTML an HTML parser that tries to return exactly what was parsed. In the process, it will do certain fixes to the HTML, (such as adding quotes to all values in start tags). As such, when it works on correct html, it's just a glorified way of doing a 'cat' and not much use. However, if you override some of the methods, this lets you build a stream editor which acts only on certain HTML elements.

In the above example, we just pass on the HTML exactly as was, but, whenever we detect a link, we try to change it to correct for the move of the base page of my climbing archive.

Return links found in the document as an array. Each array element contains an anonymous array with the follwing values:

  [$tag, $attr => $url1, $attr2 => $url2,...]

Note that $p->links will always be empty if a callback routine was provided when the HTML::LinkExtor was created.


This is an example showing how you can extract links as a document is received using LWP:

  use LWP::UserAgent;
  use HTML::LinkExtor;
  use URI::URL;

  $url = "";  # for instance
  $ua = new LWP::UserAgent;

  # Set up a callback that collect image links
  my @imgs = ();
  sub callback {
     my($tag, %attr) = @_;
     return if $tag ne 'img';  # we only look closer at <img ...>
     push(@imgs, values %attr);

  # Make the parser.  Unfortunately, we don't know the base yet (it might
  # be diffent from $url)
  $p = HTML::LinkExtor->new(\&callback);

  # Request document and parse it as it arrives
  $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])});

  # Expand all image URLs to absolute ones
  my $base = $res->base;
  @imgs = map { $_ = url($_, $base)->abs; } @imgs;

  # Print them out
  print join("\n", @imgs), "\n";




Gisle Aas <>