The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

DWH_File 0.01 - data and object persistence in deep and wide hashes

SYNOPSIS

    use DWH_File qw/ GDBM_File standard myLog /;
    # the use arguments set the DBM module used, the file locking scheme
    # and the log files name extension

    tie( %h, DWH_File, 'myFile', O_RDWR|O_CREAT, 0644 );

    tie( %h, DWH_File, 'myFile', O_RDWR|O_CREAT, 0644, 'TAG');

    untie( %h ); # essential!

TAG being the DWH_File ID for the file

DESCRIPTION

DWH_File is used in a manner resembling NDBM_File, DB_File etc. These DBM modules are limited to storing flat scalar values. References to data such as arrays or hashes are stored as useless strings and the data in the referenced structures will be lost.

DWH_File uses one of the DBM modules (configurable through the parameters to use()), but extends the functionality to not only save referenced data structures but even object systems.

This is why I made it. It makes it extremely simple to achieve persistence in object oriented Perl programs and you can skip the cumbersome interaction with a conventional database.

See "MODELS" below for the various incantations needed to make objects persistent.

DWH_File tries to make the tied hash behave as much like a standard Perl hash as possible. Besides the capability to store nested data structures DWH_File also implements exists(), delete() and undef() functionality like that of a standard hash (as opposed to all the DBM modules).

MULTIPLE DBM FILES

It is possible to distribute for instance an object system over several files if wanted. This might be practical to avoid huge single files and may also make it easier make a reasonable structure in the data. If this feature is used the same set of files should be tied each time if any of the contents that may refer across files is altered. See "MODELS".

GARBAGE COLLECTION

DWH_File uses a garbage collection scheme similar to that of Perl itself. This means that you actually don't have to worry about freeing anything (see the circular reference caveat though). Just like Perl DWH_File will remove entries that nothing is pointing to (and therefore noone can ever get at). If you've got a key whose value refers to an array for instance, that array will be swept away if you assign something else to the key. Unless there's a reference to the array somewhere else in the structure. This works even across different dbm files when using multiple files.

The garbage collection housekeeping is performed at untie time - so it is mandatory to call untie (and if you keep any references to the tied object to undef those in advance). Otherwise you'll leave the object at the mercy of global destruction and garbage won't be properly collected.

MUTUAL EXCLUSION

Note: The mutual exclusion schemes discussed in this section have only been tested sporadically and they may be both buggy and stupid.

Since DWH_File was originally inteded to be used in CGI programming the file would need to be locked at write time and DWH_File supplies two ways to do this. Both use links to the file as locks (meaning that they may not work in non-UNIX environments).

The second parameter in use() (not counting 'DWH_File') will decide which method is used. The options are none, standard and fork. The standard method will just go ahead and create links - which is why all scripts using DWH_File with this option must be setuid. I found this a bit disturbing so I deviced an alternative method called fork. If this is chosen DWH_File will fork to run the DWH_Excluder.pl script (which should reside in the same directory as DWH_File.pm if the fork option is used). Now only that script needs to be setuid - which I found somehow comforting.

It is of course possible to enforce the mutual exclusion in other ways (external to DWH_File). In that case just choose the 'none' option (default).

LOGGING

The third parameter in use() (not counting 'DWH_File') sets an extension (to be added to the name of the dbm files to generate log file names) for log files. If this is set any editing in hashes tied to DWH_File is logged. That is - the information is appended to a file called the name of the dbm file plus the extension.

The point of this feature is to make it possible to have a local DWH_File based objectsystem to tamper with at home and then be able to upload the log and get any changes registered at a remote host. Independently of the dbms used. The eat_log method does the updating:

    use Fcntl;
    use DWH_File;
    DWH_File->eat_log( "dat.dwh.dbm.log", "dat.dwh.dbm" );
    # first param: logfile to eat, second param: dbm file to eat it

FURTHER INFORMATION

For further information visit http://aut.dk/orqwood/dwh/ - home of the DWH_File

MODELS

A typical script using DWH_File

    use Fcntl;
    use DWH_File;
    # with no extra parameters to use() DWH_File defaults to:
    # AnyDBM_File, no locking and no logging
    tie( %h, DWH_File, 'myFile.dbm', O_RDWR|O_CREAT, 0644 );
    # ties %h to whatever filename the chosen DBM package
    # converts 'myFile.dwh.dbm' to.
    # DWH_File inserts '.dwh' before the last period in the
    # supplied name.

    # use the hash ... 

    # cleanup
    # (necessary whenever reference values have been tampered with)
    untie %h;

A script using data in three different files

The data in one file may refer to that in another and even that reference will be persistent.

    use Fcntl;
    use DWH_File;
    tie( %a, DWH_File, 'fileA', O_RDWR|O_CREAT, 0644, 'HA' );
    tie( %b, DWH_File, 'fileB', O_RDWR|O_CREAT, 0644, 'HB' );
    tie( %c, DWH_File, 'fileC', O_RDWR|O_CREAT, 0644, 'HC' );
    # the last parameter is a name tag on the file - it must be the same
    # every time you tie to that file or DWH will complain. This
    # mechanism frees you to change the filename much as you please.

    # use the hashes ...

    # like in:
    $a{ doo } = [ qw(doo bi dee), { a => "ah", u => "uh" } ];
    $b{ bi } = $a{ doo }[ 3 ];
    # this will work

    print "$b{ bi }{ a }\n";
    # prints "ah";

    $b{ bi }{ a } = "I've changed";
    print "$a{ doo }[ 3 ]{ a }\n"; # prints "I've changed"

    # note that if - in another program - you tie %g to 'fileB'
    # without also having tied some other hash variable to 'fileA')
    # then $g{ bi } will be undefined. The actual data is in 'fileA'.
    # Moreover there will be a high risk of corrupting the data.

    # and so on and so forth ...

    # cleanup
    # (necessary whenever reference values have been tampered with)
    untie %a;
    untie %b;
    untie %c;

A persistent class

If a class contains the following then an object $obj of that class which is referenced somewhere in a hash %h (eg. $h{ myobj } = $obj or $h{ myobs } = [ $obj, "other", "data" ] ) tied to DWH_File will still be an object of that class next time the program runs.

    package Persis;

    BEGIN
    {
        $iamDWHcapable = "Yes";
        # DWH_File will only play with classes
        # which set this variable to "Yes"
    }

    # that's it - now put in all your own methods (don't forget a
    # constructor) and you're rolling...

In case the module containing the package - against convention - has a different name from the package (plus .pm) you'll have to tell DWH_File about it as in:

    package Aix;

    BEGIN
    {
        $iamDWHcapable = "Yes";
        $mymodulename = "Provence"; # omit the .pm as ever
    }

And if you're class needs more action to get moving than just the restoring of it's state (the data (attributes) in whatever datatype you blessed in your constructor) you'll have to give DWH_File a reference to some subrutine to take care of that:

    package Alice;

    BEGIN
    {
        $iamDWHcapable = "Yes";
        $mymodulename = "Wonderland"; # omit the .pm as ever
        $restoresetup = \&WhatINeedToGetBy
    }

    sub WhatINeedToGetBy
    {
        my $self = shift;

        # Whatever ...  Maybe you want to open some files or pipes or
        # socket connections - how should I know

        # You can call the rutine what you want (might be a good
        # idea to make one that the constructor can use as well)
        # just make sure that $restoresetup points to it
    }

Go to http://aut.dk/orqwood/dwh/ for some examples.

NOTES

OTHER PLATFORMS

DWH_File works on UNIX-like systems. It also works without changes on other platforms if you don't need object persistence. To make it work with persistent classes on the Macinthosh you'll have to change a few lines in the TranslatePackage subrutine (see comments in the code). Similar changes are probably necessary on WinDOS (tell me which and I'll include them).

The locking methods use the UNIX ability to link more than one filename to file. This may not be possible on other platforms. You can either run without mutual exclusion - no problem on a single user system - or you can make up you own locking scheme suitable to you platform.

COMPETITION

It appears that DWH_File does much of the same stuff that the MLDBM module from CPAN does. There are substantial differences though, which means that both modules outperform the other in certain situations. DWH_Files main attractions are (a) it only has to load the data actually acessed into memory (b) it restores all referential identity (MLDBM sometimes makes duplicates) (c) it has an approach to setting up dynamic state elements (like sockets, pipes etc.) of objects at load time.

CAVEATS

REMEMBER UNTIE

It is very important that untie be called at the end of any script that changes data referenced by the entries in the hash. Otherwise the file will be corrupted. Also remember to remove any references to the tied object before untieing.

BEWARE OF DBM

Using DWH_File with NDBM_File I found that arrays wouldn't hold more than a few hundred entries. Assignment seemed to work OK but when global destruction came along (and the data should be flushed to disk) a segmentation error occured. It seems to me that this must be a NDBM_File related bug. I've tried it with DB_File (under linuxPPC) - 100000 entries no problem :-)

I haven't tested DWH_File with other DMB modules than DB_File (under LinuxPPC) and NDBM_File (in MacPerl and Linux (Pentium)).

At all times be aware of the limitations to data size imposed by the DBM module you use. See AnyDBM_File(3) for specs of the various DMB modules. Also some DBM modules may not be complete (I had trouble with the EXIST method not existing in NDBM_File in MacPerl).

BEWARE OF CIRCULAR REFERENCES

Your data may contain circular references which mean that the reference count is above zero eventhough the data is unreachable. This will defeat the DWH_File garbage collection scheme an thus may cause your file to swell with useless unreachable data.

    # %h being tied to DWH_File $h{ a } = [ qw( foo bar ) ];
    push @{ $h{ a } }, $h{ a };
    # the anonymous array pointed
    # to by $h{ a } now contains a
    # reference to itself
    $h{ a } = "Gone with the wind";
    # so it's refcount will now # be 1 and it won't be garbage
    # collected

To avoid the problem, break the self reference before losing touch:

    # %h being tied to DWH_File
    $h{ a } = [ qw( foo bar ) ];
    push @{ $h{ a } }, $h{ a };
    # now break the reference
    $h{ a }[ 2 ] = '';
                              
    $h{ a } = "Gone with the wind";
    # the anonymous array will be
    # garbage collected at untie time

The problem will be addressed in a future version of DWH_File so you won't have to think so much.

ALWAYS USE THE SAME FILES TOGETHER

If you use a set of hashes tied to a set of files and these hashes contain references to data in each other you must always tie the same set of files to hases when editing the content. Otherwise the data in the files may become corrupted.

LIMITATION 1

Data structures saved to disk using DWH_File must not be tied to any other class. DWH_File needs to internally tie the data to some helper classes - and Perl does not allow data to be tied to more than one class at a time. There's a (near) workaround for this which I might implement one of these day.

LIMITATION 2

You're not allowed to assign references to constants in the DWH structure as in (%h being tied to DWH_File)

    $h{ statementref } = \"I am a donut";
    # won't wash

You can't do an exact equivalent, but you can of course say

    $r = "All men are born equal";
    $h{ statementref } = \$r;
LIMITATION 3

Autovivification doen't always work. This may depend on the DBM module used. I haven't really investigated this problem but it seems that the problems I have experienced using DB_File arise either from some quirks in either DB_File or Perl itself.

This means that if you say

    %h = ();
    $h{ a }[ 3 ]{ pie } = "Apple!";

you be sure that the implicit anonymous array and hash "spring into existence" like they should. You'll have to make them exist first:

    %h = ( a =E<gt> [ undef, undef, undef, {} ] );
    $h{ a }[ 3 ]{ pie } = "Apple!";

Strangely though I have found that often autovivification does actually work but I can't find the pattern.

I don't plan on trying to fix this right now because it appears to be quite mysterious and that I can't really do anything about it on DWH_File's side.

LIMITATION 4

DWH_File hashes store straight scalars and references (blessed or not) to scalars, hashes and arrays - in other words: data. File handles and subrutine (CODE) references are not stored.

These are the only known limitations. If you encounter any others please tell me.

BUGS

Please let me know if you find any.

As the version number (0.01) indicates this is a very early beta state piece of software. Please contact me if you have any comments or suggestions - also language corrections or other comments on the documentation.

COPYRIGHT

Copyright (c) Jakob Schmidt 2000. DWH_File is free software and may be used and distributed under the same terms as Perl itself.

AUTHOR(S)

Jakob Schmidt <sumus@aut.dk>