The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Weather::GHCN::CacheURI - URI page fetch with file-based caching

VERSION

version v0.0.004

SYNOPSIS

    use Weather::GHCN::CacheURI;

    # put files cached by fetch() in $cachedir and refresh if not changed this year
    my $cache_uri = Weather::GHCN::CacheURI->new($cachedir, 'yearly');

    $cache_uri->clean_cache;    # empty the cache

    # this will cause fetch to do a network access
    my ($from_cache, $content) = $cache_uri->fetch($uri);

    # depending on the refresh option, this will either fetch the content
    # from the cache, or get a fresher copy from the network
    my ($from_cache, $content) = $cache_uri->fetch($uri);

    # fetch calls these to access the cached file according to the
    # refresh rule and the state of the cached file and the web page

    my $content = $cache_uri->loca($uri);
    $cache_uri->store($uri, $content);

DESCRIPTION

This cache module enables callers to fetch web pages and store the content on the filesystem so that it can be retrieved subsequently without a network access.

Unlike caching performed by Fetch::URI or LWP, no Etags or Last-Modified-Date or other data is included with the content data. This metadata can be an obstacle to platform portability. Essentially, just utf-8 page content that is stored. That should be neutral enough that the cache file can be used on another platform. This is a benefit to unit testing, because tests can be constructed to fetch pages, and the cached pages can be packaged with the tests. This allows the tests to run faster, and without network access.

The approach is simple, and geared towards accessing and caching the content of the NOAA GHCN weather repository. The files in that repository are simple ASCII files with uncomplicated names. The caching algorithm simply strips off the URI path and stores the file using the filename found in the repository; e.g. 'ghcnd-stations.txt' or 'CA006105887.dly'. All files are kept in the cache directory, since all filenames are expected to be unique.

FIELDS

cachedir

Returns the cache location defined when the object was instantiated.

refresh

Returns the refresh option that was defined when the object was instantiated. See new(),

METHODS

new ($cachedir, $refresh)

New instances of this class must be provided a location for the cache files upon creation ($cachedir). This directory must exist or the new() will fail. Similarly, $refresh must be a valid value, one of:

refresh 'yearly'

The origin HTTP server is contacted and the page refreshed if the cached file has not been changed within the current year. The rationale for this, and for this being the default, is that the GHCN data for the current year will always be incomplete, and that will skew any statistical analysis and so should normally be truncated. If the user needs the data for the current year, they should use a refresh value of 'always' or a number.

refresh 'never'

The origin HTTP is never contacted, regardless of the page being in cache or not. If the page is missing from cache, the fetch method will return undef. If the page is in cache, that page will be returned, no matter how old it is.

refresh 'always'

If a page is in the cache, the origin HTTP server is always checked for a fresher copy

refresh <number>

The origin HTTP server is not contacted if the page is in cache and the cached page was inserted within the last <number> days. Otherwise the server is checked for a fresher page.

clean_cache

Removes all the files in the cache, but leaves the cache directory. Returns a list of errors for any files that couldn't be removed.

clean_data_cache

Removes all the daily weather data files (*.dly) from the cache, but leaves the cache directory. Returns a list of errors for any files that couldn't be removed.

clean_station_cache

Removes the station list and station inventory files (ghcnd-*.txt) from the cache, but leaves the cache directory. Returns a list of errors for any files that couldn't be removed.

fetch ($uri, $refresh="yearly")

Fetch the web page given by the URI $uri, returning its content and caching it. If a cached entry for it exists, and is current according to the refresh option, then the cached entry is returned.

load ($uri)

Load a previously fetched and stored $uri from the file cache and returns the content. Uses Path::Tiny->slurp_utf8, which will lock the file during the operation and which uses a binmode of :unix:encoding(UTF-8) for platform portability of the files.

store ($uri, $content)

Stores content obtained from a URI using fetch() into a file in the cache. The filename is derived from the tail end of the URI.

Uses Path::Tiny->spew_utf8, which writes data to the file atomically. The file is written to a temporary file in the cache directory, then renamed over the original.

A binmode of :unix:encoding(UTF-8) (i.e. PerlIO::utf8_strict) is used, unless Unicode::UTF8 0.58+ is installed. In that case, the content will be encoded by Unicode::UTF8 and written using spew_raw.

The idea is to store data in a platform-neutral fashion, so cached files can be used for unit testing on multiple platforms.

remove ($uri)

Remove the cache file associated with this URI.

AUTHOR

Gary Puckering (jgpuckering@rogers.com)

LICENSE AND COPYRIGHT

Copyright 2022, Gary Puckering