NAME

Tie::File::Indexed - fast tied array access to indexed data files

SYNOPSIS

##========================================================================
## PRELIMINARIES

use Tie::File::Indexed;

##========================================================================
## Tied Array access

##-- tie an array (uses files "data", "data.idx", and "data.hdr")
my $filename = "data";
tie(my @data, 'Tie::File::Indexed', $filename, %options) or die ...

##-- add some data
$data[42] = 'blah';      # set an item
$data[42] = 'blip';      # overwrite an item (really appends a new record)
$data[24] = 'bonk';      # out-of-order storage
print $data[42];         # retrieve & print a stored value

##-- tweak array size
$n_items = @data;        # get number of stored records
$#data -= 2;             # chop two records off the end

#... push(), pop(), shift(), unshift(), and splice() should do What You Expect

##-- file operations
$tied->unlink();               # unlink underlying files (tied access won't work after this!)
$tied->rename($newname);       # rename underlying files using CORE::rename()
$tied->move($newname);         # move underlying files using File::Copy::move()
$copy = $tied->copy($newname); # copy underlying files using File::Copy::copy()

##-- advisory locking
$tied->flock();                # get an advisory lock on $filename
$tied->funlock();              # release our lock on $filename

##-- buffering and consolidation
my $tied = tied(@data);  # get underlying object
$tied->flush();          # flush underlying filehandles
$tied->reopen();         # close and re-open underlying filehandles
$tied->consolidate();    # remove gaps and stale values

##-- all done
undef $tied;
untie(@data);

DESCRIPTION

The Tie::File::Indexed class provides fast tied array access to raw data-files using an auxilliary packed index-file to store and retrieve the offsets and lengths of the corresponding raw data strings as well as an additional header-file to store administrative data, resulting in a constant and very small memory footprint. Random-access storage and retrieval should both be very fast, and even pop(), shift() and splice() operations on large arrays should be tolerably efficient, since these only need to modify the (comparatively small) index-file. No disk-space optimization is performed by default, and frequent overwrites will cause the data-file to grow monotonically: see "consolidate" for a workaround.

The Tie::File::Indexed distribution also comes with several pre-defined subclasses for transparent encoding/decoding of UTF8-encoded strings, and complex data structures encoded via the JSON or Storable modules. See "SUBCLASSES" for details.

Constructors etc.

new
$tied = CLASS->new(%opts);
$tied = CLASS->new($file,%opts);
$tied = tie(@array, CLASS, $file, %opts);

Creates and returns a new Tie::File::Indexed object, in the third form tying it to the Perl array @array. Currently accepted %options:

file   => $file,    ##-- file basename; uses files "${file}", "${file}.idx", "${file}.hdr"
mode   => $mode,    ##-- open mode (fcntl flags or perl-style; default='rwa')
perms  => $perms,   ##-- default: 0666 & ~umask
pack_o => $pack_o,  ##-- file offset pack template (default='J')
pack_l => $pack_l,  ##-- string-length pack template (default='J')
bsize  => $bsize,   ##-- buffer size in bytes for index batch-operations (default=2**21 = 2MB)

When opening an existing file, administrative header-data is read from the header-file $file.hdr, which is written on close() if opened in write-mode. Raw data records are read/written from/to the data-file $file, and their offsets and lengths are stored as packed integers in the index-file $file.idx. The options pack_o and pack_l control the pack-templates to use for the index-file; the default pack templates use the 'J' pack format to pack offsets and lengths as Perl internal unsigned integer values. See the entry for "pack" in perlfunc for details.

defaults
%defaults = CLASS_OR_OBJECT->defaults()

Default attributes for constructor; can be overridden by subclasses.

DESTROY
undef = $tied->DESTROY();

Destructor implicitly calls close().

Subclass API: Data I/O

writeData
$bool = $tied->writeData($data);

Write item $data to $tied->{datfh} at its current position. After writing, $tied->{datfh} should be positioned to the first byte following the written item. The object is assumed to be opened in write-mode. The default implementation just writes $data as a byte-string (undef is written as the empty string). Can be overridden by subclasses to perform transparent encoding of complex data.

readData
$data_or_undef = $tied->readData($length);

Read item data of length $length from $tied->{datfh} at its current position. Can be overridden by subclasses to perform transparent decoding of complex data.

Subclass API: Index I/O

readIndex
($off,$len) = $tied->readIndex($index);
($off,$len) = $tied->readIndex(undef);

Reads an index-record from $tied->{idxfh}. If $index is undef, read from the current position of $tied->{idxfh}, otherwise reads the index record for item at logical index $index, which is assumed to exist in the array. Returns offset and length in $tied->{datfh} of the item data, or the empty list on error.

writeIndex
$tied_or_undef = $tied->writeIndex($index,$off,$len);
$tied_or_undef = $tied->writeIndex(undef, $off,$len);

Writes index-record for a logical item to tied->{idxfh}. If $index is undef, writes at the current position of $tied->{idxfh}, otherwise writes a record for the logical iundex $index, creating one if it doesn't already exist. Returns the tied object on success, undef on error.

shiftIndex
$tied_or_undef = $tied->shiftIndex($start,$n,$shift);

Moves $n index records starting from $start by $shift positions (may be negative). Operates directly on $tied->{idxfh}. Doesn't change unaffected values. Used by SPLICE() method.

Object API: header

headerKeys
@keys = $tied->headerKeys();

Keys to save as header.

headerData
\%header = $tied->headerData();

Data to save as header.

loadHeader
$tied_or_undef = $tied->loadHeader();
$tied_or_undef = $tied->loadHeader($headerFile,%opts);

Loads header from $headerFile, by default "$tied->{file}.hdr".

saveHeader
$tied_or_undef = $tied->saveHeader();
$tied_or_undef = $tied->saveHeader($headerFile);

Saves header data to $headerFile, by default "$tied->{file}.hdr".

Object API: open/close

open
$tied_or_undef = $tied->open($file,$mode);
$tied_or_undef = $tied->open($file);
$tied_or_undef = $tied->open();

Opens underlying file(s) for use.

close
$tied_or_undef = $tied->close();

Close any opened files, writes header if opened in write mode.

opened
$bool = $tied->opened();

Returns true iff object is opened.

reopen
$bool = $tied->reopen();

Closes and re-opens underlying filehandles. Should cause a "real" flush even on systems without a working IO::Handle::flush() method.

flush
$tied_or_undef = $tied->flush();
$tied_or_undef = $tied->flush($flushHeader);

Attempts to flush underlying filehandles using their flush() method if available, otherwise calls "reopen" in reopen(). If $flushHeader is specified and true, also writes header file.

Object API: file operations

$tied_or_undef = $tied->unlink();
$tied_or_undef = CLASS_OR_OBJECT->unlink($file);

Attempts to unlink any underlying file(s) for the data-file $file. Implicitly calls close() before unlinking.

rename
$tied_or_undef = $tied->rename($newname);

Renames underlying files using CORE::rename(). Implicitly close()s and re-open()s $tied, which must be opened in write-mode.

copy
$dst_object_or_undef = $tied_src->copy($dst_filename, %dst_opts);
$dst_object_or_undef = $tied_src->copy($dst_object);

Copies underlying files using File::Copy::copy(). Source object must be opened. Implicitly calls flush() on both source and destination objects before and after the copy operation, respectively. If a destination object is specified (2nd form), it must be opened in write-mode, otherwise a new destination object will be created and returned. You canNOT use this method to convert between incompatible file formats (e.g. Storable and JSON), but it should be faster than array assignment:

tie(my @a, 'Tie::File::Indexed::JSON',     'a.tfx');
tie(my @b, 'Tie::File::Indexed::Storable', 'b.tfx');
tied(@a)->copy(tied(@b));                                # this won't work!
@b = @a;                                                 # ... but this ought to

tie(my @a2, 'Tie::File::Indexed::JSON', 'a2.tfx');
@a2 = @a;                                                # slow element-wise copy
tied(@a)->copy(tied(@a2));                               # ... fast bulk copy
move
$tied_or_undef = $tied->move($newname);

Moves underlying files using File::Copy::move(). Implicitly close()s and re-open()s $tied, which must be opened in write-mode.

Object API: advisory locking

flock
$bool = $tied->flock();
$bool = $tied->flock($lock);

Get an advisory lock of type $lock (default=Fcntl::LOCK_EX) on $tied->{datfh}, using perl's flock() function. Implicitly calls flush() prior to locking.

funlock
$bool = $tied->funlock();
$bool = $tied->funlock($lock);

Unlock $tied->{datfh} using perl's flock() function; $lock defaults to Fcntl::LOCK_UN.

Object API: buffering and consolidation

consolidate
$tied_or_undef = $tied->consolidate();
$tied_or_undef = $tied->consolidate($tmpfile);

Consolidates file data: ensures that data in $tied->{datfh} are in index-order and contain no gaps or unused blocks. The object must be opened in write-mode. Uses $tmpfile as a temporary file for consolidation (default="$tied->{file}.tmp").

If you never overwrite data in your tied arrays, you probably won't need this method. It can be useful to reduce the size of the associated data-file and/or optimize index-ordered access operations, since (over)writing any existing array item causes a new record to be appended to the data-file. Consider the following code:

tie(my @data, 'Tie::File::Indexed', "data") or die ...  ##-- tie the array; data-file is empty
$data[1] = 'bar';                                       ##-- data-file is now "bar"
$data[0] = 'foo';                                       ##-- data-file is now "barfoo"
$data[1] = 'baz';                                       ##-- data-file is now "barfoobaz"

Here, the element at index 0 ("foo") is stored "out-of-order", since its phyiscal location in the data-file (2nd record) does not correspond to its logical location in the array (1st element). Further, the 1st record in the data-file ("bar") is unused, since it was overwritten by the value "baz" stored in the 3rd data-file record. The index-file takes care of resolving the offset and length of the logical array elements (so that e.g. $data[1] eq 'baz' rather than 'bar'), but no effort is made to re-use unreferenced material in the data-file, so that the original value for $data[1] is effectively orphaned. Calling consolidate() at this point ensures that the disk-files are logically sorted and contain no unreferenced material:

tied(@data)->consolidate();                             ##-- data-file is now "foobaz"

This method is never implicitly called, so if you need it, you'll have to call it yourself.

SUBCLASSES

The default data storage methods in Tie::File::Indexed are suitable for simple perl scalars (integers, floating-point numbers, or simple byte-strings). The Tie::File::Indexed distribution comes with several pre-defined subclasses for storing other types of data as well. Currently, the following pre-defined subclasses are supported:

Tie::File::Indexed::Utf8

Stores data records as UTF-8 encoded strings. Useful if your data strings are expected to be encoded in UTF8.

Tie::File::Indexed::JSON

Stores data records as JSON strings using the JSON module. Useful if you need to store complex data structures and simple scalars in the same tied array.

Tie::File::Indexed::Storable

Stores data records in native binary format using Storable::nstore_fd(). Useful if you need to store only references (bless()ed or otherwise) to be used on the local machine. Individual data records can be used directly with Storable::retrieve_fd().

Tie::File::Indexed::StorableN

Stores data records in portable "network" binary format using Storable::nstore_fd(). Useful if you need to store only references (bless()ed or otherwise) to be shared between machine architectures. Individual data records can be used directly with Storable::retrieve_fd().

Tie::File::Indexed::Freeze

Stores data records in native binary format using Storable::freeze(). Useful if you need to store only references (bless()ed or otherwise) to be used on the local machine. Data-files are slightly smaller than those produced by Tie::File::Indexed::Storable, but individual data records cannot be used directly with Storable::retrieve_fd().

Tie::File::Indexed::FreezeN

Stores data records in portable "network" binary format using Storable::nfreeze(). Useful if you need to store only references (bless()ed or otherwise) to be shared between machine architectures. Data-files are slightly smaller than those produced by Tie::File::Indexed::Storable, but individual data records cannot be used directly with Storable::retrieve_fd().

CAVEATS

Monotonic growth and random access

No disk-space optimization is performed by default, and frequent overwrites will however cause the data-file to grow monotonically: every time a logical item is written to the array via the STORE() method, a new physical record is appended to the data-file, and the index-record for the item is updated to point to the new record. This is fine if you only insert elements in logical order (e.g. using push) and never overwrite elements which have already been stored. Otherwise, out-of-order elements may degrade performance for logical-order access (e.g. via foreach), since lots of random seek() operations often don't play nicely together with perl's buffering strategy and or the underlying filesystem cache. Overwriting elements is a bigger problem, since overwrites cause the associated data-file to grow ever larger. The "consolidate" method is provided as a workaround for these undesirable effects. Future versions of this module may perform some implicit on-the-fly disk-space optimization or consolidation, although currently no such implicit optimization or consolidation is performed: if you need to consolidate, do it yourself!

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.20.2 or, at your option, any later version of Perl 5 you may have available.