The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

App::FetchwareX::HTMLPageSync - An App::Fetchware extension that downloads files based on an HTML page.

VERSION

version 1.016

SYNOPSIS

Example App::FetchwareX::HTMLPageSync Fetchwarefile.

    page_name 'Cool Wallpapers';

    html_page_url 'http://some-html-page-with-cool.urls';

    destination_directory 'wallpapers';

    # pretend to be firefox
    user_agent 'Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1';

    # Customize the callbacks.
    html_treebuilder_callback sub {
        # Get one HTML::Element.
        my $h = shift;

        # Return true or false to indicate if this HTML::Element shoudd be a
        # download link.
        if (something) {
            return 'True';
        } else {
            return undef;
        }
    };

    download_links_callback sub {
        my @download_urls = @_;

        my @wanted_download_urls;
        for my $link (@download_urls) {
            # Pick ones to keep.
            puse @wanted_download_urls, $link;
        }

        return @wanted_download_urls;
    };

App::FetchwareX::HTMLPageSync App::Fetchware-like API.

    my $temp_file = start();

    my $download_url = lookup();

    download($temp_dir, $download_url);

    verify($download_url, $package_path);

    unarchive($package_path);

    build($build_path);

    install();

    uninstall($build_path);

MOTIVATION

I want to automatically parse a Web page with links to wall papers that I want to download. Only I want software to do it for me. That's where this App::Fetchware extension comes in.

DESCRIPTION

App::FetchwareX::HTMLPageSync is an example App::Fetchware extension. It's not a large extension, but instead is a simple one meant to show how easy it is extend App::Fetchware.

App::FetchwareX::HTMLPageSync parses the Web page you specify to create a list of download links. Then it downloads those links, and installs them to your destination_directory.

In order to use App::FetchwareX::HTMLPageSync to help you mirror the download links on a HTML page you need to create a App::FetchwareX::HTMLPageSync Fetchwarefile, you can do this easily by just running fetchware new, and typing in HTMLPageSync when it asks you what extension of Fetchwarefile you want to create. Or create a Fetchwarefile manually. Then you'll need to learn how to use that Fetchwarefile with fetchware.

App::FetchwareX::HTMLPageSync API SUBROUTINES

This is App::FetchwareX::HTMLPageSync's API that fetchware uses to execute any Fetchwarefile's that make use of App::FetchwareX::HTMLPageSync. This API is the same that regular old App::Fetchware uses for most standard FOSS software, and this internal documentation is only needed when debugging HTMLPageSync's code or when studying it to create your own fetchware extension.

new()

    my ($program_name, $fetchwarefile) = new($term, $program_name);

    # Or in an extension, you can return whatever list of variables you want,
    # and then cmd_new() will provide them as arguments to new_install() except
    # a $term Term::ReadLine object will precede the others.
    my ($term, $program_name, $fetchwarefile, $custom_argument1, $custom_argument2)
        = new($term, $program_name);

new() is App::Fetchware's API subroutine that implements fetchware's new command. It simply uses Term::UI to ask the user some questions that determine what configuration options will be added to the genereted Fetchwarefile. new() takes a $term, Term::UI/Term::Readline object, and the optional name of the program or Website in this case that HTMLPageSync is page syncing.

Whatever scalars (not references just regular strings) that new() returns will be shared with new()'s sister API subroutine new_install() that is called after new() is called by cmd_install(), which implements fetchware's new command. new_install() is called in the parent process, so it does have root permissions, so be sure to test it as root as well.

drop_privs() NOTES

This section notes whatever problems you might come accross implementing and debugging your Fetchware extension due to fetchware's drop_privs mechanism.

See Util's drop_privs() subroutine for more info.

  • This subroutine is not run as root; instead, it is run as a regular user unless the stay_root configuration option has been set to true.

get_html_page_url()

    my $html_page_url = get_html_page_url($term);

Uses $term argument as a Term::ReadLine/Term::UI object to interactively explain what a html_page_url is, and to ask the user to provide one and press enter.

get_destination_directory()

    my $destination_directory = get_destination_directory($term);

Uses $term argument as a Term::ReadLine/Term::UI object to interactively explain what a destination_directory is, and to ask the user to provide one and press enter.

ask_about_keep_destination_directory()

    ask_about_keep_destination_directory($term, $fetchwarefile);

ask_about_keep_destination_directory() does just that it asks the user if they would like to enable the keep_destination_directory configuration option to preserve their destination_directory when they uninstall the assocated Fetchware package or Fetchwarefile. If they answer Y, keep_destination_directory is added to their Fetchwarefile, and if not nothing is added, because deleteing their destination_directory is the default that will happen even if the keep_destination_directory is not even in the Fetchwarefile.

new_install()

    my $fetchware_package_path = new_install($page_name, $fetchwarefile);

new_install() asks the user if they would like to install the previously generated Fetchwarefile that new() created. If they answer yes, then that program associated with that Fetchwarefile is installed. In our case, that means that whatever files are configured for download will be downloaded. If they answer no, then the path to the generated Fetchwarefile will be printed.

new_install() is imported by App::Fetchware::ExportAPI from App::Fetchware, and also exported by App::FetchwareX::HTMLPageSync. This is how App::FetchwareX::HTMLPageSync "subclasses" App::Fetchware.

check_syntax()

    'Syntax Ok' = check_syntax()
Configuration subroutines used:
none

Calls check_config_options() to check for the following syntax errors in Fetchwarefiles. Note by the time check_syntax() has been called parse_fetchwarefile() has already parsed the Fetchwarefile, and any syntax errors in the user's Fetchwarefile will have already been reported by Perl.

This may seem like a bug, but it's not. Do you really want to try to use regexes or something to try to parse the Fetchwarefile reliably, and then report errors to users? Or add PPI of all insane Perl modules as a dependency just to write syntax checking code that most of the time says the syntax is Ok anyway, and therefore a complete waste of time and effort? I don't want to deal with any of that insanity.

Instead, check_syntax() uses config() to examine the already parsed Fetchwarefile for "higher-level" or "Fetchware-level" syntax errors. Syntax errors that are Fetchware syntax errors instead of just Perl syntax errors.

For yours and my own convienience I created check_config_options() helper subroutine. Its data driven, and will check Fetchwarefile's for three different types of common syntax errors that occur in App::Fetchware's Fetchwarefile syntax. These errors are more at the level of logic errors than actual syntax errors. See its POD below for additional details.

Below briefly lists what App::Fetchware's implementation of check_syntax() checks.

  • Mandatory configuration options

    • page_name, html_page_url, and destination_directory are required for all Fetchwarefiles.

drop_privs() NOTES

This section notes whatever problems you might come accross implementing and debugging your Fetchware extension due to fetchware's drop_privs mechanism.

See Util's drop_privs() subroutine for more info.

  • check_syntax() is run in the parent process before even start() has run, so no temporary directory is available for use.

start()

    my $temp_file = start();

start() creats a temp dir, chmod 700's it, and chdir()'s to it just like the one in App::Fetchware does. App::FetchwareX::HTMLPageSync

start() is imported use App::Fetchware::ExportAPI from App::Fetchware, and also exported by App::FetchwareX::HTMLPageSync. This is how App::FetchwareX::HTMLPageSync "subclasses" App::Fetchware.

lookup()

    my $download_url = lookup();

lookup() downloads the user specified html_page_url, parses it using HTML::TreeBuilder, and uses html_treebuilder_callback and download_http_url if specified to maniuplate the tree to determine what download urls the user wants.

This list of download urls is returned as an array reference, $download_url.

download()

    download($temp_dir, $download_url);

download() uses App::Fetchware's utility function download_http_url() to download all of the urls that lookup() returned. If the user specifed a user_agent configuration option, then that option is passed along to download_http_url()'s call to HTTP::Tiny.

verify()

    verify($download_url, $package_path);

verify() simply calls App::Fetchware's :UTIL subroutine do_nothing(), which as you can tell from its name does nothing, but return. The reason for the useless do_nothing() call is simply for better documentation, and standardizing how to override a App::Fetchware API subroutine in order for it to do nothing at all, so that you can prevent the original App::Fetchware subroutine from doing what it normally does.

unarchive()

    unarchive();

unarchive() does nothing by calling App::Fetchware's :UTIL subroutine do_nothing(), which does nothing.

build()

    build($build_path);

build() does the same thing as verify(), and that is nothing by calling App::Fetchware's do_nothing() subroutine to better document the fact that it does nothing.

install()

    install($package_path);

install() takes the $package_path, which is really an array ref of the paths of the files that download() copied, and copies them the the user specified destination directory, destination_directory.

end()

    end();

end() chdir()s back to the original directory, and cleans up the temp directory just like the one in App::Fetchware does. App::FetchwareX::HTMLPageSync

end() is imported use App::Fetchware::ExportAPI from App::Fetchware, and also exported by App::FetchwareX::HTMLPageSync. This is how App::FetchwareX::HTMLPageSync "subclasses" App::Fetchware.

uninstall()

    uninstall($build_path);

Uninstalls App::FetchwareX::HTMLPageSync by recursivly deleting the destination_directory where it stores the wallpapers or whatever you specified it to download for you. If you would like to keep your destination_directory, then set the keep_destination_directory to true in your Fetchwarefile, and Fetchware will not delete you destination_directory, when you uninstall your Fetchware package.

upgrade()

    my $upgrade = upgrade($download_path, $fetchware_package_path)

    if ($upgrade) {
        ...
    }
Configuration subroutines used:
none

Uses $download_path, an arrayref of URLs to download in HTMLPageSync, and compares it against the list of files that has already been downloaded by glob()ing destination_directory. And then comparing the file names of the specified files.

Returns true if $download_path has any URLs that have not already been downloaded into destination_directory. Note: HEAD HTTP querries are not used to check if any already downloaded files are newer than the files in the destination_directory.

Returns false if $download_path is the same as destination_directory.

drop_privs() NOTES

This section notes whatever problems you might come accross implementing and debugging your Fetchware extension due to fetchware's drop_privs mechanism.

See Util's drop_privs() subroutine for more info.

  • upgrade() is run in the child process as nobody or user, because the child needs to know if it should actually bother running the rest of fetchware's API subroutines.

MANUALLY CREATING A App::FetchwareX::HTMLPageSync FETCHWAREFILEN

In order to use App::FetchwareX::HTMLPageSync you must first create a Fetchwarefile to use it. You can use fetchware new as explain above, or create one manually in your text editor.

1. Name it

Use your text editor to create a file with a .Fetchwarefile file extension. Use of this convention is not required, but it makes it obvious what type of file it is. Then, just copy and paste the example text below, and replace [page_name] with what you choose your page_name to be. page_name is simply a configuration opton that simply names your Fetchwarefile. It is not actually used for anything other than to name your Fetchwarefile to document what program or behavior this Fetchwarefile manages.

    use App::FetchwareX::HTMLPageSync;

    page_name '[page_name]';

Fetchwarefiles are actually small, well structured, Perl programs that can contain arbitrary perl code to customize fetchware's behavior, or, in most cases, simply specify a number of fetchware or a fetchware extension's (as in this case) configuration options. Below is my filled in example App::FetchwareX::HTMLPageSync fetchwarefile.

    use App::FetchwareX::HTMLPageSync;

    page_name 'Cool Wallpapers';

Notice the use App::FetchwareX::HTMLPageSync; line at the top. That line is absolutely critical for this Fetchwarefile to work properly, because it is what allows fetchware to use Perl's own syntax as a nice easy to use syntax for Fetchwarefiles. If you do not use the matching use App::Fetchware...; line, then fetchware will spit out crazy errors from Perl's own compiler listing all of the syntax errors you have. If you ever receive that error, just ensure you have the correct use App::Fetchware...; line at the top of your Fetchwarefile.

2. Determine your html_page_url

At the heart of App::FetchwareX::HTMLPageSync is its html_page_url, which is the URL to the HTML page you want HTMLPageSync to download and parse out links to wallpaper or whatever else you'd like to automate downloading. To figure this out just use your browser to find the HTML page you want to use, and then copy and paste the url between the single quotes ' as shown in the example below.

    html_page_url '';

And then after you copy the url.

    html_page_url 'http://some.url/something.html';
3. Determine your destination_directory

HTMLPageSync also needs to know your destination_directory. This is the directory that HTMLPageSync will copy your downloaded files to. This directory will also be deleted when you uninstall this HTMLPageSync fetchware package just like a standard App::Fetchware package would uninstall any installed software when it is uninstalled. Just copy and paste the example below, and fill in the space between the single quotes '.

    destination_directory '';

After pasting it should look like.

    destination_directory '~/wallpapers';

Furthermore, if you want to keep your destination_directory after you uninstall your HTMLPageSync fetchware package, just set the keep_destination_directory configuration option to true:

    keep_destination_directory 'True';

If this is set in your HTMLPageSync Fetchwarefile, HTMLPageSync will not delete your destination_directory when your HTMLPageSync fetchware package is uninstalled.

4. Specifiy other options

That's all there is to it unless you need to further customize HTMLPageSync's behavior to get just the links you need to download.

At this point you can install your new Fetchwarefile with:

    fetchware install [path to your new fetchwarefile]

Or you can futher customize it as shown next.

5. Specify an optional user_agent

Many sites don't like bots downloading stuff from them wasting their bandwidth, and will even limit what you can do based on your user agent, which is the HTTP standard's name for your browser. This option allows you to pretend to be something other than HTMLPageSync's underlying library, HTTP::Tiny. Just copy and past the example below, and paste what you want you user agent to be between the single quotes ' as before.

    user_agent '';

And after pasting.

    user_agent 'Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1';
6. Specify an optonal html_treebuilder_callback

html_treebuilder_callback specifies an optional anonymous Perl subroutine reference that will replace the default one that HTMLPageSync uses. The default one limits the download to only image format links, which is flexible enough for downloading wallpapers.

If you want to download something different, then paste the example below in your Fetchwarefile.

    html_treebuilder_callback sub {
        # Get one HTML::Element.
        my $h = shift;

        # Return true or false to indicate if this HTML::Element shoudd be a
        # download link.
        if (something) {
            return 'True';
        } else {
            return undef;
        }
    };

And create a Perl anonymous subroutine CODEREF that will be executed instead of the default one. This requires knowledge of the Perl programming language. The one below limits itself to only pdfs and MS word documents.

    # Download pdfs and word documents only.
    html_treebuilder_callback sub {
        my $tag = shift;
        my $link = $tag->attr('href');
        if (defined $link) {
            # If the anchor tag is an image...
            if ($link =~ /\.(pdf|doc|docx)$/) {
                # ...return true...
                return 'True';
            } else {
                # ...if not return false.
                return undef; #false
            }
        }
    };

download_links_callback specifies an optional anonymous Perl subroutine reference that will replace the default one that HTMLPageSync uses. The default one removes the HTML::Element skin each download link is wrapped in, because of the use of HTML::TreeBuilder. This simply strips off the object-oriented crap its wrapped in, and turns it into a simply string scalar.

If you want to post process the download link in some other way, then just copy and paste the code below into your Fetchwarefile, and add whatever other Perl code you may need. This requires knowledge of the Perl programming language.

    download_links_callback sub {
        my @download_urls = @_;

        my @wanted_download_urls;
        for my $link (@download_urls) {
            # Pick ones to keep.
            puse @wanted_download_urls, $link;
        }

        return @wanted_download_urls;
    };

USING YOUR App::FetchwareX::HTMLPageSync FETCHWAREFILE WITH FETCHWARE

After you have created your Fetchwarefile as shown above you need to actually use the fetchware command line program to install, upgrade, and uninstall your App::FetchwareX::HTMLPageSync Fetchwarefile.

Take note how fetchware's package management metaphor does not quite line up with what App::FetchwareX::HTMLPageSync does. Why would a HTML page mirroring script be installed, upgraded, or uninstalled? Well HTMLPageSync simply adapts fetchware's package management metaphor to its own enviroment performing the likely action for when one of fetchware's behaviors are executed.

new

A fetchware new will cause HTMLPageSync to ask the user a bunch of questons, and help them create a new HTMLPageSync Fetchwarefile.

install

A fetchware install while using a HTMLPageSync Fetchwarefile causes fetchware to download your html_page_url, parse it, download any matching links, and then copy them to your destination_directory as you specify in your Fetchwarefile.

upgrade

A fetchware upgrade will redownload the html_page_url, parse it, and compare the corresponding list of files to the list of files already downloaded, and if any new files have been added, then they will be downloaded. New versions of existing files is not supported. No timestamp checking is implemented currently.

uninstall

A fetchware uninstall will cause fetchware to delete this fetchware package from its database as well as recursively deleting everything inside your destination_directory as well as that directory itself. So when you uninstall a HTMLPageSync fetchware package ensure that you really want to, because it will delete whatever files it downloaded for you in the first place.

However, if you would like fetchware to preserve your destination_directory, you can set the boolean keep_destination_directory configuration option to true, like keep_destination_directory 'True';, to keep HTMLPageSync from deleting your destination directory.

HOW App::FetchwareX::HTMLPageSync OVERRIDES App::Fetchware

This sections documents how App::FetchwareX::HTMLPageSync overrides App::Fetchware's API, and is only interesting if you're debugging App::FetchwareX::HTMLPageSync, or you're writing your own App::Fetcwhare extension. If not, you don't need to know these details.

App::Fetchware API Subroutines

new()

HTMLPageSync overrides new(), and implements its own Q&A wizard interface helping users create HTMLPageSync Fetchwarefiles.

new_install()

HTMLPageSync just inherits App::Fetchware's new_install(), which just asks the user if they would like Fetchware to instell the already generated Fetchwarefile.

check_syntax()

check_syntax() is also overridden to check HTMLPageSync's own Fetchware-level syntax.

start() and end()

HTMLPageSync just imports start() and end() from App::Fetchware to take advantage of their ability to manage a temporary directory.

lookup()

lookup() is overridden, and downloads the html_page_url, which is the main configuration option that HTMLPageSync uses. Then lookup() parses that html_page_url, and determines what the download urls should be. If the html_trebuilder_callback and download_links_callbacks exist, then they are called to customize lookup()'s default bahavior. See their descriptions below.

download()

download() downloads the array ref of download links that lookup() returns.

verify()

verify() is overridden to do nothing.

unarchive()

verify() is overridden to do nothing.

build()

build() is overridden to do nothing.

install()

install() takes its argument, which is an arrayref of of the paths of the files that were downloaded to the tempdir created by start(), and copies them to the user's provided destination_directory.

end() and start()

HTMLPageSync just imports end() and start() from App::Fetchware to take advantage of their ability to manage a temporary directory.

uninstall()

uninstall() recursively deletes your destination_directory where it stores whatever links you choose to download unless of course the keep_destination_directory configuration option is set to true.

upgrade()

Determines if any looked up URLs have not been downloaded yet, and returns true if that is the case.

App::FetchwareX::HTMLPageSync's Configuration Subroutines

Because HTMLPageSync is a App::Fetchware extension, it can not just use the same configuration subroutines that App::Fetchware uses. Instead, it must create its own configuration subroutines with App::Fetchware::CreateConfigOptions. These configuration subroutines are the configuration options that you use in your App::Fetchware or App::Fetchware extension.

page_name [MANDATORY]

HTMLPageSync's equivelent to App::Fetchware's program_name. It's simply the name of the page or what you want to download on that page.

html_page_url [MANDATORY]

HTMLPageSync's equivelent to App::Fetchware's lookup_url, and is just as mandatory. This is the url of the HTML page that will be downloaded and processed.

destination_directory [MANDATORY]

This option is also mandatory, and it specifies the directory where the files that you want to download are downloaded to.

user_agent [OPTIONAL]

This option is optional, and it allows you to have HTML::Tiny pretend to be a Web browser or perhaps bot if you want to.

html_treebuilder_callback [OPTIONAL]

This optional option allows you to specify a perl CODEREF that lookup() will execute instead of its default callback that just looks for images.

It receives one parameter, which is an HTML::Element at the first a, anchor/link tag.

It must return 'True'; to indicate that that link should be included in the list of download links, or return false, return undef, to indicate that that link should not be included in the list of download links.

This optional option specifies an optional callback that will allow you to do post processing of the list of downloaded urls. This is needed, because the results of the html_treebuilder_callback are still HTML::Element objects that need to be converted to just string download urls. That is what the default download_links_callback does.

It receives a list of all of the download HTML::Elements that html_treebuilder_callback returned true on. It is called only once, and should return a list of string download links for download later by HTML::Tiny in download().

keep_destination_directory [OPTIONAL]

This optional option is a boolean true or false configuration option that when true prevents HTMLPageSync from deleting your destination_directory when you run fetchware uninstall.

Its default is false, so by defualt HTMLPageSync will delete your files from your destination_directory unless you set this to true.

ERRORS

As with the rest of App::Fetchware, App::Fetchware::Config does not return any error codes; instead, all errors are die()'d if it's App::Fetchware::Config's error, or croak()'d if its the caller's fault. These exceptions are simple strings, and are listed in the "DIAGNOSTICS" section below.

CAVEATS

Certain features of App::FetchwareX::HTMLPageSync require knowledge of the Perl programming language in order for you to make use of them. However, this is limited to optional callbacks that are not needed for most uses. These features are the html_treebuilder_callback and download_links_callback callbacks.

AUTHOR

David Yingling <deeelwy@gmail.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2016 by David Yingling.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.