NAME
Tie::File::Indexed - fast tied array access to indexed data files
SYNOPSIS
##========================================================================
## PRELIMINARIES
use Tie::File::Indexed;
##========================================================================
## Tied Array access
##-- tie an array (uses files "data", "data.idx", and "data.hdr")
my $filename = "data";
tie(my @data, 'Tie::File::Indexed', $filename, %options) or die ...
##-- add some data
$data[42] = 'blah'; # set an item
$data[42] = 'blip'; # overwrite an item (really appends a new record)
$data[24] = 'bonk'; # out-of-order storage
print $data[42]; # retrieve & print a stored value
##-- tweak array size
$n_items = @data; # get number of stored records
$#data -= 2; # chop two records off the end
#... push(), pop(), shift(), unshift(), and splice() should do What You Expect
##-- file operations
$tied->unlink(); # unlink underlying files (tied access won't work after this!)
$tied->rename($newname); # rename underlying files using CORE::rename()
$tied->move($newname); # move underlying files using File::Copy::move()
$copy = $tied->copy($newname); # copy underlying files using File::Copy::copy()
##-- advisory locking
$tied->flock(); # get an advisory lock on $filename
$tied->funlock(); # release our lock on $filename
##-- buffering and consolidation
my $tied = tied(@data); # get underlying object
$tied->flush(); # flush underlying filehandles
$tied->reopen(); # close and re-open underlying filehandles
$tied->consolidate(); # remove gaps and stale values
##-- all done
undef $tied;
untie(@data);
DESCRIPTION
The Tie::File::Indexed class provides fast tied array access to raw data-files using an auxilliary packed index-file to store and retrieve the offsets and lengths of the corresponding raw data strings as well as an additional header-file to store administrative data, resulting in a constant and very small memory footprint. Random-access storage and retrieval should both be very fast, and even pop(), shift() and splice() operations on large arrays should be tolerably efficient, since these only need to modify the (comparatively small) index-file. No disk-space optimization is performed by default, and frequent overwrites will cause the data-file to grow monotonically: see "consolidate" for a workaround.
The Tie::File::Indexed distribution also comes with several pre-defined subclasses for transparent encoding/decoding of UTF8-encoded strings, and complex data structures encoded via the JSON or Storable modules. See "SUBCLASSES" for details.
Constructors etc.
- new
-
$tied = CLASS->new(%opts); $tied = CLASS->new($file,%opts); $tied = tie(@array, CLASS, $file, %opts);
Creates and returns a new Tie::File::Indexed object, in the third form tying it to the Perl array @array. Currently accepted %options:
file => $file, ##-- file basename; uses files "${file}", "${file}.idx", "${file}.hdr" mode => $mode, ##-- open mode (fcntl flags or perl-style; default='rwa') perms => $perms, ##-- default: 0666 & ~umask pack_o => $pack_o, ##-- file offset pack template (default='J') pack_l => $pack_l, ##-- string-length pack template (default='J') bsize => $bsize, ##-- buffer size in bytes for index batch-operations (default=2**21 = 2MB)
When opening an existing file, administrative header-data is read from the header-file $file.hdr, which is written on close() if opened in write-mode. Raw data records are read/written from/to the data-file $file, and their offsets and lengths are stored as packed integers in the index-file $file.idx. The options
pack_o
andpack_l
control the pack-templates to use for the index-file; the default pack templates use the 'J' pack format to pack offsets and lengths as Perl internal unsigned integer values. See the entry for "pack" in perlfunc for details. - defaults
-
%defaults = CLASS_OR_OBJECT->defaults()
Default attributes for constructor; can be overridden by subclasses.
- DESTROY
-
undef = $tied->DESTROY();
Destructor implicitly calls close().
Subclass API: Data I/O
- writeData
-
$bool = $tied->writeData($data);
Write item $data to
$tied->{datfh}
at its current position. After writing,$tied->{datfh}
should be positioned to the first byte following the written item. The object is assumed to be opened in write-mode. The default implementation just writes$data
as a byte-string (undef
is written as the empty string). Can be overridden by subclasses to perform transparent encoding of complex data. - readData
-
$data_or_undef = $tied->readData($length);
Read item data of length
$length
from$tied->{datfh}
at its current position. Can be overridden by subclasses to perform transparent decoding of complex data.
Subclass API: Index I/O
- readIndex
-
($off,$len) = $tied->readIndex($index); ($off,$len) = $tied->readIndex(undef);
Reads an index-record from
$tied->{idxfh}
. If$index
isundef
, read from the current position of$tied->{idxfh}
, otherwise reads the index record for item at logical index$index
, which is assumed to exist in the array. Returns offset and length in$tied->{datfh}
of the item data, or the empty list on error. - writeIndex
-
$tied_or_undef = $tied->writeIndex($index,$off,$len); $tied_or_undef = $tied->writeIndex(undef, $off,$len);
Writes index-record for a logical item to
tied->{idxfh}
. If$index
isundef
, writes at the current position of$tied->{idxfh}
, otherwise writes a record for the logical iundex$index
, creating one if it doesn't already exist. Returns the tied object on success, undef on error. - shiftIndex
-
$tied_or_undef = $tied->shiftIndex($start,$n,$shift);
Moves
$n
index records starting from$start
by$shift
positions (may be negative). Operates directly on$tied->{idxfh}
. Doesn't change unaffected values. Used by SPLICE() method.
Object API: header
- headerKeys
-
@keys = $tied->headerKeys();
Keys to save as header.
- headerData
-
\%header = $tied->headerData();
Data to save as header.
- loadHeader
-
$tied_or_undef = $tied->loadHeader(); $tied_or_undef = $tied->loadHeader($headerFile,%opts);
Loads header from
$headerFile
, by default"$tied->{file}.hdr"
. - saveHeader
-
$tied_or_undef = $tied->saveHeader(); $tied_or_undef = $tied->saveHeader($headerFile);
Saves header data to
$headerFile
, by default"$tied->{file}.hdr"
.
Object API: open/close
- open
-
$tied_or_undef = $tied->open($file,$mode); $tied_or_undef = $tied->open($file); $tied_or_undef = $tied->open();
Opens underlying file(s) for use.
- close
-
$tied_or_undef = $tied->close();
Close any opened files, writes header if opened in write mode.
- opened
-
$bool = $tied->opened();
Returns true iff object is opened.
- reopen
-
$bool = $tied->reopen();
Closes and re-opens underlying filehandles. Should cause a "real" flush even on systems without a working IO::Handle::flush() method.
- flush
-
$tied_or_undef = $tied->flush(); $tied_or_undef = $tied->flush($flushHeader);
Attempts to flush underlying filehandles using their
flush()
method if available, otherwise calls "reopen" in reopen(). If$flushHeader
is specified and true, also writes header file.
Object API: file operations
- unlink
-
$tied_or_undef = $tied->unlink(); $tied_or_undef = CLASS_OR_OBJECT->unlink($file);
Attempts to unlink any underlying file(s) for the data-file $file. Implicitly calls close() before unlinking.
- rename
-
$tied_or_undef = $tied->rename($newname);
Renames underlying files using CORE::rename(). Implicitly close()s and re-open()s
$tied
, which must be opened in write-mode. - copy
-
$dst_object_or_undef = $tied_src->copy($dst_filename, %dst_opts); $dst_object_or_undef = $tied_src->copy($dst_object);
Copies underlying files using File::Copy::copy(). Source object must be opened. Implicitly calls flush() on both source and destination objects before and after the copy operation, respectively. If a destination object is specified (2nd form), it must be opened in write-mode, otherwise a new destination object will be created and returned. You canNOT use this method to convert between incompatible file formats (e.g. Storable and JSON), but it should be faster than array assignment:
tie(my @a, 'Tie::File::Indexed::JSON', 'a.tfx'); tie(my @b, 'Tie::File::Indexed::Storable', 'b.tfx'); tied(@a)->copy(tied(@b)); # this won't work! @b = @a; # ... but this ought to tie(my @a2, 'Tie::File::Indexed::JSON', 'a2.tfx'); @a2 = @a; # slow element-wise copy tied(@a)->copy(tied(@a2)); # ... fast bulk copy
- move
-
$tied_or_undef = $tied->move($newname);
Moves underlying files using File::Copy::move(). Implicitly close()s and re-open()s
$tied
, which must be opened in write-mode.
Object API: advisory locking
- flock
-
$bool = $tied->flock(); $bool = $tied->flock($lock);
Get an advisory lock of type
$lock
(default=Fcntl::LOCK_EX
) on$tied->{datfh}
, using perl's flock() function. Implicitly calls flush() prior to locking. - funlock
-
$bool = $tied->funlock(); $bool = $tied->funlock($lock);
Unlock
$tied->{datfh}
using perl's flock() function;$lock
defaults toFcntl::LOCK_UN
.
Object API: buffering and consolidation
- consolidate
-
$tied_or_undef = $tied->consolidate(); $tied_or_undef = $tied->consolidate($tmpfile);
Consolidates file data: ensures that data in
$tied->{datfh}
are in index-order and contain no gaps or unused blocks. The object must be opened in write-mode. Uses$tmpfile
as a temporary file for consolidation (default="$tied->{file}.tmp"
).If you never overwrite data in your tied arrays, you probably won't need this method. It can be useful to reduce the size of the associated data-file and/or optimize index-ordered access operations, since (over)writing any existing array item causes a new record to be appended to the data-file. Consider the following code:
tie(my @data, 'Tie::File::Indexed', "data") or die ... ##-- tie the array; data-file is empty $data[1] = 'bar'; ##-- data-file is now "bar" $data[0] = 'foo'; ##-- data-file is now "barfoo" $data[1] = 'baz'; ##-- data-file is now "barfoobaz"
Here, the element at index 0 ("foo") is stored "out-of-order", since its phyiscal location in the data-file (2nd record) does not correspond to its logical location in the array (1st element). Further, the 1st record in the data-file ("bar") is unused, since it was overwritten by the value
"baz"
stored in the 3rd data-file record. The index-file takes care of resolving the offset and length of the logical array elements (so that e.g.$data[1] eq 'baz'
rather than'bar'
), but no effort is made to re-use unreferenced material in the data-file, so that the original value for$data[1]
is effectively orphaned. Callingconsolidate()
at this point ensures that the disk-files are logically sorted and contain no unreferenced material:tied(@data)->consolidate(); ##-- data-file is now "foobaz"
This method is never implicitly called, so if you need it, you'll have to call it yourself.
SUBCLASSES
The default data storage methods in Tie::File::Indexed are suitable for simple perl scalars (integers, floating-point numbers, or simple byte-strings). The Tie::File::Indexed distribution comes with several pre-defined subclasses for storing other types of data as well. Currently, the following pre-defined subclasses are supported:
- Tie::File::Indexed::Utf8
-
Stores data records as UTF-8 encoded strings. Useful if your data strings are expected to be encoded in UTF8.
- Tie::File::Indexed::JSON
-
Stores data records as JSON strings using the JSON module. Useful if you need to store complex data structures and simple scalars in the same tied array.
- Tie::File::Indexed::Storable
-
Stores data records in native binary format using Storable::nstore_fd(). Useful if you need to store only references (bless()ed or otherwise) to be used on the local machine. Individual data records can be used directly with Storable::retrieve_fd().
- Tie::File::Indexed::StorableN
-
Stores data records in portable "network" binary format using Storable::nstore_fd(). Useful if you need to store only references (bless()ed or otherwise) to be shared between machine architectures. Individual data records can be used directly with Storable::retrieve_fd().
- Tie::File::Indexed::Freeze
-
Stores data records in native binary format using Storable::freeze(). Useful if you need to store only references (bless()ed or otherwise) to be used on the local machine. Data-files are slightly smaller than those produced by Tie::File::Indexed::Storable, but individual data records cannot be used directly with Storable::retrieve_fd().
- Tie::File::Indexed::FreezeN
-
Stores data records in portable "network" binary format using Storable::nfreeze(). Useful if you need to store only references (bless()ed or otherwise) to be shared between machine architectures. Data-files are slightly smaller than those produced by Tie::File::Indexed::Storable, but individual data records cannot be used directly with Storable::retrieve_fd().
CAVEATS
Monotonic growth and random access
No disk-space optimization is performed by default, and frequent overwrites will however cause the data-file to grow monotonically: every time a logical item is written to the array via the STORE()
method, a new physical record is appended to the data-file, and the index-record for the item is updated to point to the new record. This is fine if you only insert elements in logical order (e.g. using push
) and never overwrite elements which have already been stored. Otherwise, out-of-order elements may degrade performance for logical-order access (e.g. via foreach
), since lots of random seek()
operations often don't play nicely together with perl's buffering strategy and or the underlying filesystem cache. Overwriting elements is a bigger problem, since overwrites cause the associated data-file to grow ever larger. The "consolidate" method is provided as a workaround for these undesirable effects. Future versions of this module may perform some implicit on-the-fly disk-space optimization or consolidation, although currently no such implicit optimization or consolidation is performed: if you need to consolidate, do it yourself!
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2015 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.20.2 or, at your option, any later version of Perl 5 you may have available.