David Caldwell
and 1 contributors

NAME

File::HashCache - Process and cache files based on the hash of their contents.

SYNOPSIS

  use File::HashCache;

  my $hc = File::HashCache->new(cache_dir => 'js',
                                process_js => \&JavaScript::Minifier::XS::minify,
                                process_css => \&CSS::Minifier::XS::minify);

  my $hashed_minified_path = $hc->hash("my_javascript_file.js");
  # returns "my_javascript_file-7f4539486f2f6e65ef02fe9f98e68944.js"

  # If you are using Template::Toolkit you may want something like this:
  $template->process('template.tt2', {
      script => sub {
          my $path = $hc->hash($_[0]);
          "<script src=\"js/$path\" type=\"text/javascript\"></script>\n";
      } } ) || die $template->error();

  # And in your template.tt2 file:
  #    [% script("myscript.js") %]
  # which will get replaced with something like:
  #    <script src="js/myscript-708b88f899939c4adedc271d9ab9ee66.js"
  #            type="text/javascript"></script>

DESCRIPTION

File::HashCache is an automatic versioning scheme for arbitrary files based on the hash of the contents of the files themselves. It aims to be painless for the developer and very fast.

File::HashCache solves the problem in web development where you update some Javascript, CSS, or image files on the server and the end user ends up with mismatched versions because of browser or proxy caching issues. By referencing your external files by their MD5 hash, the browser is unable to to give the end user mismatched versions no matter what the caching policy is.

HOW TO USE IT

The best place to use File::HashCache is in your HTML template code. While generating a page to serve to the user, call the hash() method for each Javascript, CSS, or image file you are including in your page. The hash() method will return the name of the newly hashed file. You should use this name in the rendered contents of the page.

This means that when the browser gets the page you serve, it will have references to specific versions of the files.

METHODS

new(%options)

Initializes a new cache object. Available options and their defaults:

cache_dir => '.hashcache'

Where to put the resulting minified files.

process_* => sub { @_[0] }

The process_* parameters are subroutine references and will be called for each file that's passed in based on the file extension. For instance, to process PNGs you would use 'process_png', javascript files would use 'process_js'. This subroutine will be called with the entire contents of the file and it should return a modified version to cache.

If the processing needs to add any dependencies so that File::HashCache can know when it needs to re-process it (other than the original file, which it automatically kept track of), the processing subroutine should return a list with the new contents as the first item and the extra dependencies as the rest of the list.

If the hash() method is called with a filename whose extension does not have a corresponding process_* subroutine, then the contents of the file are copied to the hash directory with no modifications.

cache_file => "$cache_dir/cache.json"

Where to put the cache control file.

hash($path_to_file, ...)

This method...

  1. Reads the file(s) into memory and concatenates them. This allows you to group CSS or Javascript sources together in one big bundle. If you pass in multiple files to be concatentated they must all have the same extension (otherwise processing the resulting concatenated data would be problematic).

  2. Calls the appropriate processing function to process the data.

  3. Calculates the MD5 hash of the processed data.

  4. Saves the data to a cache directory where it is named based on its hash value which makes the name globally unique (it also keeps it's original name as a prefix so debugging is sane).

  5. Keeps track of the original file names, the minified file's globally unique name, and the dependencies used to build the image. This is stored in a hash table and also saved to the disk for future runs.

  6. Returns the name of the minified file that was stored in step 4. This name does not include the cache directory path because its physical file system path does not necessarily relate to its virtual server path.

There's actually a step 0 in there too: If the original file name is found in the hash table then it quickly stats its saved dependencies to see if they are newer than the saved, already processed, file. If the processed file is up to date then steps 1 through 5 are skipped.

FURTHER DISCUSSION ABOUT THIS TECHNIQUE

It keeps the files you serve in sync

When the user refreshes the page they will either get the page from their browser cache or they will get it from our site. No matter where it came from the hashed Javascript, CSS, and image files it references are now uniquely named so that it is impossible for the files to be out of date from each other.

That is, if you get the old HTML file you will reference all the old named files and everything will be mutually consistent (even though it is out of date). If you get the new HTML file it guarantees you will have to fetch the latest files because the new HTML only references the new hashed names that aren't going to be in your browser cache.

It's fast.

Everything is cached so it only does the minification and hash calculations once per file. More importantly the cached dir can be statically served by the web server so it's exactly as fast as it would be if you served the files without any preprocessing. All this technique adds is a couple filesystem stats per page load, which isn't much (Linux can do something like a million stats per second).

It's automatic.

If you hook in through Template::Toolkit then there's no script to remember to run when you update the site. When the template generates the HTML, the File::HashCache code lazily takes care of rebuilding any files that may have gone out of date.

It's stateless.

It doesn't rely on incrementing numbers ("js/v10/script.js" or even "js/script-v10.js"). We considered this approach but decided it was actually harder to implement and had no advantages over the way we chose to do it. This may have been colored by our choice of version control systems (we love the current wave of DVCSes) where monotonically increasing version numbers have no meaning.

It allows aggressive caching.

Since the files are named by their contents' hash, you can set the cache time on your web server to be practically infinite.

It's very simple to understand.

It took less than a page of Perl code to implement the whole thing and it worked the first time with no bugs. I believe it's taken me longer to write this than it took to write the code (granted I'd been thinking about it for a long time before I started coding).

No files are deleted.

The old files are not automatically deleted (why bother, they are tiny) so people with extremely old HTML files will not have inconsistent pages when they reload. However:

The cache directory is volatile.

It's written so we can delete the entire cache dir at any point and it will just recreate what it needs to on the next request. This means there's no extra setup to do in your app.

You get a bit of history.

Do a quick ls -lrt of the directory and you can see which scripts have been updated recently and in what order they got built.

SEE ALSO

This code was adapted from the code we wrote for our site http://greenfelt.net/. Here is our original blog post talking about the technique: http://blog.greenfelt.net/2009/09/01/caching-javascript-safely/

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Copyright (C) 2009-2013 David Caldwell and Jim Radford.

AUTHOR

  • David Caldwell <david@porkrind.org>

  • Jim Radford <radford@blackbean.org>