The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::Persistent - Perl database aimed at storing HTML tree structures.

SYNOPSIS

  use HTML::Persistent;
  $db = HTML::Persistent->new({ dir => '/tmp/stuff' });
  $db->{flibber} = 'This is my comment';
  $db->{flobber} = 'This is not the same comment';
  $db->{Animals}[1] = 'dog';
  $db->{Animals}[2] = 'cat';
  $db->{Animals}[3] = 'bird';
  $node = $db->{special}{numbers};
  $node->[1] = 1.23456;
  $node->[2] = 2.34567;
  $node->[3] = 3.45678;
  print $node->[1] + $node->[2] + $node->[3];

  $db->sync(); # Update files on disk

DESCRIPTION

This provides an interface that provides convenient access to data with a syntax that is mostly comfortable for perl users. It uses the overload and tie trick to allow both array and hash references to be acceptable in arbitrary mix, and allows a mild language ambiguity to assign values to nodes as well as visiting new nodes.

For example, assigning a sub-node to a variable creates the necessary sub-node (if it does not already exist), but evaluating the same in a string context will reveal the data value contained in the sub-node (or undef if it does not exist). Evaluating in a numeric context also returns undef if the node does not exist, but forces the string into a number if a data-value can be found (following normal perl rules).

The database should be concurrent (i.e. multiple processes can safely open the same database) but regular calls to sync() are required since locking is only released on a sync() call, judging how often to run the sync() is a matter for the application but it will usually be a somewhat expensive call (in the background using perl Storable which decomposes the objects into bytecodes and writes at least one entire file). Some granularity factors are tunable (e.g. largest whole file before splitting it down into directories and smaller files) and these may effect the optimal sync() placement. In addition, the sync() may be seen as a transaction boundary, but the only rollback feature is just throwing away the $db object and starting a new object (which is reasonably cheap to do).

The general intention is for medium to long lived server processes to call sync() when they are waiting for more work (e.g. waiting for a web request) and to try to atomically complete whole requests. Also, it is generally intended to be faster in a read-only situation (shared locks) than a read/write situation (exclusive locks).

HASH NODES

Hash nodes have a key (which can be any string) and a name (which is forced to be exactly the same as the key at all times). The hash node can also contain data (which can be any arbitrary perl scalar, so long as "store" can handle it).

Hash keys

To find the keys in a hash node, use either the normal keys function, or a special function hash_keys():

  $node->hash_keys()

or

  keys( %$node )

ARRAY NODES

Array nodes have a key (which must be a non-negative integer) and a name (which is arbitrarily settable). The entire array will be created up to the maximum key (like usual for perl arrays), so it is typically sensible not to use excessively large integers without good reason. The perl trick of backwards reading arrays using negative numbers is not supported (but may be in future).

Arrays as scalars (count items in array)

Typical perl use of an array reference might be to use the scalar() function, it also works like:

  $node->array_scalar()

or

  scalar( @$node )

DATA VALUES

Any data may be stored at a node, but must be a scalar (using a ref for complex data is no problem). The database does not attempt to see the internals of a given data value; it is considered to be a single black-box item (even if it is a complex data object). Objects may be stored as well (since most objects are a hash ref or array ref) but don't try to store nodes of the tree within other nodes, that is certainly not supported (although it might coincidentally work sometimes). Objects from unrelated packages should work fine, but note that under the hood, the method of storage is "use Storable" which has some limitations so be sure to read the "WARNING" section of the Storable documentation.

A node contains lots of ugly internal links that make it a very bad idea to attempt to store a node inside a tree. However, sometimes the concept of storing a node inside a tree is attractive, thus we have symlinks instead. A symlink extracts the internal path of a node and stores in a safe object that can easily be put into the main tree.

Any attempt to store a node inside a tree, triggers off an automatic conversion to a symlink before storage (in order to prevent the problem of having real nodes in the tree). Reading a symlink back out of a tree silently converts it into a node again, so the DB user should never need to directly handle symlinks. Note that using a symlink is always safe, even if the real data has been deleted or perhaps never even existed, but the "undef" value will be the result of attempting to read from such a node.

Symlinks do NOT automatically activate when a path goes past a place where the symlink exists (unlike unix file system semantics). They only activate when pulled out as a node value and used as an actual node. This example should clarify things:

   $node1 = $db->{some}{path};
   $node2 = $db->{a}{different}{path};
   $node1->set_val( $node2 ); # Silently converts to a symlink
   $node3 = $db->{some}{path}{past}{link}; # Does NOT follow symlink
   $node4 = $db->{some}{path}->val(); # Equivalent to $node2 but not same object

What this means is if you want to follow a symlink you have to either know you are expecting one to exist in a certain place, or you have to check each step of the way to discover one. The reason for this is efficiency... creating a node does not normally visit the database at all, we have lazy database opening only being triggered by either $node->val() or $node->set_val() functions. That means when a node is created we don't actually have any idea whether that path might have crossed any symlinks. Also, for many applications those links will only exist in well known places, or the application may not use symlinks at all.

Possibly, a slower path traversal mechanism should be supplied as standard that does follow symlinks. This is not currently available. This would open up the need to check a symlink pointing to another symlink, possibly along a chain, even circular chains that need to be detected.

NODE PATH

It is often useful to convert the full node path to a string format (i.e. right back to the root) or convert the other way, take a single string and traverse all the way out from the root. We already have an SHA hash to munge a path but this is not typically exposed to the user (and in general, the calculation is lazy because such a hash is only occasionally required).

TODO

Items that don't work, but really should do.

Read only mode

Open a database with a flag set such that: * files are always opened with strictly read permission; * locks on files are always read locks; * any attempt to set data will generate a croak() error.

This is useful in situations where it is known that no writing is required for some particular application (e.g. output HTML pages from existing datastore). The storage system is intended to be more efficient for multiple rapid reads than for regular writing (typical web delivery scenario).

Shallow write mode

This would be similar to "read only mode" except that attempting to write to the database would not cause an error, but instead keeps the written data only in memory, without writing back to the filesystem. This is useful in situations where an HTML page might require some rewriting (e.g. a template) before output to the end user, however those rewritten results are temporary and we don't want to put them back into the database for the long term.

Typical web delivery workflow would look like:

 * Collect various data model items (from this database or anywhere else)
 * Load template of web page (open in shallow write mode)
 * Inject any items into template to produce dynamic page
 * Export page to end user
 * Close database and throw away modified template

Smarter write locks

At the moment we work on a "lock everything" mentality and it's either a full read lock or a full write lock. The effect is that only one process can be writing at any given time, and this process must queue on the lock until all the readers are done.

This is the safe but slow design, and smarter designs can do better. There's a bunch of ways to handle this including putting small writes into a logfile which can safely be appended without locking (need to guarantee atomic append is possible). Another possibility is writing versioned files (with some suffix) and using a symlink for the real file (symlink can be moved in an atomic fashion). All of the normal problems of ACID (Atomicity, Consistency, Isolation, Durability) come into play.

SQL back end

MySQL and PostgreSQL have become leading open source data storage and retrieval engines so it would be attractive to allow a back-end coupling directly into an SQL database. However, at first guess it would be very slow given that tree structures don't map into SQL particularly neatly.

TDB back end

         https://tdb.samba.org/

Might be easier to implement than SQL, and possibly faster too. Native support for transactions and internal locks is already provided by TDB which might improve performance. Also, a TDB_File module already exists (without transaction support).

SEE ALSO

perltie, Storable

AUTHOR

Telford Tendys, <ttndy@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2013 by Telford Tendys

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.1 or, at your option, any later version of Perl 5 you may have available.