HTML::SimpleLinkExtor - Extract links from HTML
use HTML::SimpleLinkExtor; my $extor = HTML::SimpleLinkExtor->new(); $extor->parse_file($filename); #--or-- $extor->parse($html); #extract all of the links @all_links = $extor->links; #extract the img links @img_srcs = $extor->img; #extract the frame links @frame_srcs = $extor->frame; #extract the hrefs @area_hrefs = $extor->area; @a_hrefs = $extor->a; @base_hrefs = $extor->base; @hrefs = $extor->href; #extract the body background link @body_bg = $extor->body; @background = $extor->background;
This is a simple HTML link extractor designed for the person who does not want to deal with the intricacies of HTML::Parser or the de-referencing needed to get links out of HTML::LinkExtor.
HTML::Parser
HTML::LinkExtor
You can extract all the links or some of the links (based on the HTML tag name or attribute name). If a <BASE HREF> tag is found, all of the relative URLs will be resolved according to that reference.
This module is simply a subclass around HTML::LinkExtor, so it can only parse what that module can handle. Invalid HTML or XHTML may cause problems.
Create the link extractor object.
Create the link extractor object and resolve the relative URLs accoridng to the supplied base URL. The supplied base URL overrides any other base URL found in the HTML.
Create the link extractor object and do not resolve relative links.
Parse the file for links.
Parse the HTML in $data.
$data
Return a list of the links.
Return a list of the links from all the SRC attributes of the IMG.
Return a list of all the links from all the SRC attributes of the FRAME.
Return a list of the links from all the SRC attributes of any tag.
Return a list of the links from all the HREF attributes of the A tags.
Return a list of the links from all the HREF attributes of the AREA tags.
Return a list of the links from all the HREF attributes of the BASE tags. There should only be one.
Return a list of the links from all the HREF attributes of any tag.
Return the link from the BODY tag's BACKGROUND attribute.
Return the link from the SCRIPT tag's SRC attribute
This module doesn't handle all of the HTML tags that might have links. If someone wants those, I'll add them, or you can edit %AUTO_METHODS in the source.
Will Crain who identified a problem with IMG links that had a USEMAP attribute.
This source is part of a SourceForge project which always has the latest sources in CVS, as well as all of the previous releases.
http://sourceforge.net/projects/brian-d-foy/
If, for some reason, I disappear from the world, one of the other members of the project can shepherd this module appropriately.
brian d foy, <bdfoy@cpan.org>
<bdfoy@cpan.org>
Copyright (c) 2004 brian d foy. All rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install HTML::SimpleLinkExtor, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::SimpleLinkExtor
CPAN shell
perl -MCPAN -e shell install HTML::SimpleLinkExtor
For more information on module installation, please visit the detailed CPAN module installation guide.