The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

PackRow2

PackRow2 takes list of items and packs them (non-destructively) into a string of <= maxsize bytes. If offset is not specified, it builds the string starting with the last item in the list, prepending it with each preceding item until it runs out of space or the list is fully consumed. If the packer runs out of space, it returns the offset into the list where it stopped. The offset may be supplied as an argument to this function, and the packer will pack the remainder of the list starting at the offset, working back to the beginning of the list. The final argument to the packer is a "next pointer", a string that identifies the location of the next part of a row split into multiple pieces. Since the packer processes a list from back to front, the address of the "next" piece can be obtained before constructing the preceding piece. If the packer can process a complete list, it returns an array containing a single packed string, a byte string consisting of a count of the number of packed items, followed by length/value pairs for each item. If the packer runs out of space, it returns an array of the packed string and the offset of the remaining items

For example, given the list @a = qw(alpha bravo charlie delta), and a maxsize=15, PackRow2 returns a packed string (something like x01x05delta) and the offset 3, indicating that the last item in the list was processed, and the packer ran out of space at the third item. The packed string could be stored in a pushhash, which would return an index, e.g. "5/2", suitable for a next pointer. Packing the remainder of the string generates another packed string (e.g. x02x07charliex035/2) and the offset 2. The packing and storage process continues until the entire list is consumed.

advanced topics

null vector

The packed string always contains a bitstring to identify null columns, which is used by UnPackRow to correctly distinguish between nulls and zero length strings.

next pointer

Since the next pointer is used to find the next part of a split row, it must always remain whole -- if it was split, how could you find the next piece? The next pointer is a convention supported by PackRow/UnPackRow to facilitate the construction of methods that manipulate split rows. The packing function only flattens an array into a byte string or series of strings; it does not provide any intrinsic support to traverse these strings. Functions that manipulate packed rows may use additional structures to support multi-part rows, such as external metadata in the block row directory, or specialized metadata columns embedded in the row itself.

column splitting (fragmentation)

The packer can support rows with individual columns that exceed the maxsize. The offset can simultaneously maintain the current column position, as well as the current character offset in that column. It's wicked complicated. Generally, we say that a row is split into row pieces, and the row pieces are chained (via the next pointers), which lets us reconstruct a complete row. Individual columns that are split are said to be fragmented.

future work

The packer could be extended to support more complex structures than arrays of scalars. In lieu of this ability, these structures can be flattened using Data::Dumper or YAML to large strings.

NAME

Genezzo::Util - Utility functions

TODO

Should bundle all data file utility functions, such as FileGetHeaderInfo, SetHeaderInfo, etc, under separate Util::DataFile module
FileGetHeaderInfo: need to handle case of header which exceeds a single block. Probably should keep increasing the buffer size until find null terminator (within reason).
packrow: store metadata in col0 vs trailing col with next ptr
packrow: check pack format for a zero len row of zero cols. Does it need a nullvec?
packrow/unpackrow: in Perl 5.8 could use the nifty repeating templates to our advantage.
packrow: could generate skiplists as col zero metadata tracking byte position and column numbers to speed lookups

AUTHOR

Jeffrey I. Cohen, jcohen@genezzo.com

SEE ALSO

perl(1).

Copyright (c) 2003-2007 Jeffrey I Cohen. All rights reserved.

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA

Address bug reports and comments to: jcohen@genezzo.com

For more information, please visit the Genezzo homepage at http://www.genezzo.com