anarch - A script for creating offline copies of websites
0.03 (alpha)
anarch -start=http://www.example.com/some_page.html \ [ -root=http://www.example.com ] \ [ -exclude='^http://www\.example\.com/dont-want-this/' ] \ [ -save-as=folder\ name ] \ [ -depth=5 ] \ [ -run-scripts ] \ [ -remove-scripts ] \ [ -dom ] \ [ -sync ]
anarch is a script for creating offline copies of websites. It downloads a website, correcting links in pages and style sheets so that they are all relative (and all links outside the root directory are absolute), and removing '<base href>' tags. It can also run scripts in pages (to find out which files the scripts use or to save pages with generated content) and remove scripts.
The page to start on
Only get URLs beginning with this. If this is omitted, -start is used, trimmed to the last slash.
-start
Regular expression for URLs to be excluded
Where to save it. If this is omitted, the last path segment of -root is used.
-root
How many links to follow one after the other before going back
Run scripts in HTML pages. This will be used to find which files the scripts need, so that those can be fetched as well. It is not always guaranteed that this will work, as some scripts have absolute URLs hard-coded.
Remove scripts from HTML pages. This can be used in conjunction with -run-scripts to save generated content while removing the scripts that generated it. -remove-scripts implies -dom.
-run-scripts
-remove-scripts
-dom
Save the HTML DOM, possibly modified by scripts or -remove-scripts. Without this, the only changes to the DOM that are saved are those made to links to make them relative.
Synchronise mode: Only files that have changed since the last download will be downloaded if this option is given.
If you find any bugs, please e-mail the author.
This program doesn't take redirection into account.
It doesn't work with pages that have an explicit encoding in the source code.
It doesn't work with pages not in UTF-8. (It would need to apply a charset attribute to various elements or save the page in the original encoding.)
Style attributes containing URLs get mangled.
When URLs in CSS style sheets are made relative, they are not properly escaped, so quotation marks may produce invalid CSS.
The -run-scripts option causes the script to eat up all your memory.
If a scripts browses to another page and -dom or -remove-scripts is specified, then the wrong DOM tree is serialised.
There is no way to set the local IP address to bind to. But you can use this nutty workaround, which requires Hook::WrapSub:
perl -sS -MHook::WrapSub -MIO::Socket::INET \ -M'less Hook::WrapSub::wrap_subs sub{push @_, LocalAddr => "10.10.10.205" }, "IO::Socket::INET::new"' \ anarch ...
There are no tests yet.
This program requires perl 5.8.3 or higher (5.8.4 or higher recommended) and the following CPAN modules:
WWW::Scripter
CSS::DOM 0.03 or higher
URI
File::Slurp
LWP 5.815 or higher
HTML::DOM 0.025 or higher
WWW::Scripter::Plugin::Ajax is required for the -run-scripts option to work.
Copyright (C) 2009, Father Chrysostomos (sprout at, um, cpan dot org)
This program is free software; you may redistribute or modify it (or both) under the same terms as perl.
To install App::Anarch, copy and paste the appropriate command in to your terminal.
cpanm
cpanm App::Anarch
CPAN shell
perl -MCPAN -e shell install App::Anarch
For more information on module installation, please visit the detailed CPAN module installation guide.