NAME
Data::Bvec - a module to manipulate integer arrays as bit vectors and "compressed bit strings" (using a simple RLE).
VERSION
VERSION: 1.01
SYNOPSIS
use
Data::Bvec;
my
$bv
= Data::Bvec::->new(
nums
=>[1,2,3] );
my
$vec
=
$bv
->get_bvec();
# 01110000
my
$bstr
=
$bv
->get_bstr();
# '-134'
my
$nums
=
$bv
->get_nums();
# [1,2,3]
----
my
$vec
= num2bit( [1,2,3] );
# 01110000
set_bit(
$vec
, 4, 1 );
# 01111000
my
$bstr
= compress bit2str
$vec
;
# '-143'
my
$nums
= bit2num str2bit uncompress
$bstr
;
# [1,2,3,4]
DESCRIPTION/DISCUSSION
This module encapsulates some simple routines for manipulating Perl bit vectors (putting values in; getting values out), but its main goal is to implement a simple run-length encoding scheme for bit vectors that compresses them into relatively human-readable and flat-file-storable strings.
My use case was wanting to prototype a data indexing system, and I wanted to ease debugging by plopping the bitstrings in a flat file that I could examine directly. (Each bit in a vector represents a record in the database -- true or false whether the term is in that record in the field being indexed.) It has worked well enough that I haven't felt the need to change how the bitstrings are stored (just where they're stored).
The initial version of the module used a different set of base-62 digits. In writing Math::Int2Base, I decided to normalize all the bases from 2 to 62 to use 0-9,A-Z,a-z. It makes the numbers sort correctly (ascii-betically == numerically), and it let me say that A base-16 == A base-36 == A base-62.
So now I'm rewriting this module to use those base conversion routines.
EXPORTS
Nothing is exported by default. The following may be exported individually; all of them may be exported using the :all
tag:
- set_bit
- howmany
- bit2str
- str2bit
- bit2num
- num2bit
- compress
- uncompress
Examples:
However, if you only use the object methods, nothing would need to be exported. See below.
SUBROUTINES
set_bit( $vec, $num, $zero_or_one )
This is a shallow wrapper around Perl's vec() that simply provides the third parameter (1) to that routine that says we're working with a bit vector.
Normally returns $num, if you care.
Parameters:
$vec
A Perl bit vector stored in the scalar.
$num
The number whose bit you want to target in the bit vector.
$zero_or_one
The value you want to set the bit to: 0 or 1. If not defined, 1 is assumed.
Examples:
my
$vec
=
""
;
# empty vector
set_bit
$vec
, 1, 1;
# 01000000
set_bit
$vec
, 2, 1;
# 01100000
set_bit
$vec
, 3;
# 01110000
set_bit
$vec
, 1, 0;
# 00110000
bit2str( $vec )
This routine is a shallow wrapper around unpack() that unpacks a bit vector into a string of '0's and '1's, in preparation for compression.
Parameters:
$vec
A Perl bit vector.
Example:
my
$vec
=
""
;
set_bit
$vec
, 4, 1;
# 00001000
my
$str
= bit2str
$vec
;
# '00001000'
str2bit( $str )
This routine is a shallow wrapper around pack() that packs a string of '0's and '1's (following uncompression) into a bit vector.
Parameters:
$str
A string of '0's and '1's, e.g., "00001000".
Example:
my
$vec
= str2bit
'00001000'
;
num2bit( \@integers )
This routine accepts an array ref of integers and returns a bit vector with those integer's bits turned on.
Parameters:
\@integers
A reference to an array of integers.
Examples:
my
$vec
= num2bit [1,2,3];
# 01110000
my
$vec
= num2bit [3,2,1];
# 01110000
The second example is intended to make clear that the order of the integers in the array is not retained (for obvious reasons), and calling bit2num( $vec ) will always return the integers in ascending order (see bit2num() below).
bit2num( $vec, $beg, $cnt )
This routine accepts a bit vector and returns an array of integers represented by the 1 bits.
The parameters $beg and $cnt are to support retrieving subsets of integers from a large vector -- in essence, to support "paging" through the set.
In scalar context, returns a reference to the array.
Parameters:
$vec
A bit vector.
$beg
The first integer (where the bit is 1) to return. Unlike array subscripts, the $beg positions start with 1, not 0.
$cnt
The maximum number of integers (including the first) to return.
Examples:
# 0----+----1----+----2----+----3-
my
$vec
= str2bit
'01110011110001111100001111110001'
;
my
$set1
= bit2num
$vec
, 1, 5;
# [ 1, 2, 3, 6, 7 ]
my
$set2
= bit2num
$vec
, 6, 5;
# [ 8, 9, 13, 14, 15 ]
my
$set3
= bit2num
$vec
, 11, 5;
# [ 16, 17, 22, 23, 24 ]
my
$set4
= bit2num
$vec
, 16, 5;
# [ 25, 26, 27, 31 ]
compress( $str )
This routine takes a string of '0's and '1's and compresses it using a simple run-length encoding (RLE). It returns this "compressed bit string".
Parameters:
$str
A string of '0's and '1's, e.g., "01110".
Note: the length of the string need not be a multiple of 8.
Example:
my
$bstr
;
$bstr
= compress
'01110000'
;
# '-134'
my
$str
= (
'1'
x100).(
'0'
x30).(
'1'
x6);
$bstr
= compress
$str
;
# '+@1cU6'
Compression Scheme
The compression scheme counts the number of consecutive '0's and '1's and concatenates that count (in base-62) to the compressed bit string.
If the first bit is '0', the compressed bit string begins with '-'. If the first bit is '1', it begins with '+'. The digit following that represents that many of those bits. The next digit represents that many of the "other" bits, and so on. (A "digit" matches /[0-9A-Za-z]/.)
So in the first example, '-134' means 1 '0' bit, then 3 '1' bits, then 4 '0' bits, i.e., '01110000'.
The second example includes a 2-digit number, 1c base-62 (100 decimal, as defined by Math::Int2base).
Any multi-digit number is preceded by a non-digit:
'@'
for
a 2-digit number
'#'
for
3 digits
'$'
for
4 digits
'%'
for
5 digits, and
'^'
for
6 digits
(Mnemonic: look above the numbers on a qwerty keyboard. A 6-digit number will accommodate 32,590,299,105 consecutive bits. If you need more than that, let me know.)
So '+@1cU6' means 1c (100) '1' bits, then U (30) '0' bits, then 6 '1' bits.
uncompress( $bstr )
This routine uncompresses a compressed bit string (which would have been compressed by the compress() routine above).
It returns a string of '0's and '1's. This string will (normally) then be converted to a bit vector using str2bit() above.
Parameters:
$bstr
A compressed bit string (see compress() above).
Example:
my
$bstr
=
'-134'
;
my
$str
= uncompress
$bstr
;
# '01110000'
howmany( $vec, $zero_or_one )
This routine returns a count of the 0 or 1 bits in a bit vector.
Parameters:
$vec
A bit vector.
$zero_or_one
The value you want a count of: 0 or 1. Defaults to 1 if not given.
Examples:
my
$vec
= str2bit
'01010010'
;
my
$ones_count
= howmany
$vec
;
# 3
my
$zeros_count
= howmany
$vec
, 0;
# 5
Note that howmany( $vec, 0 ) will include trailing zero bits.
METHODS
new()
This constructs a Data::Bvec object. Each object represents a single array of integers stored either as a bit vector, a compressed bit string, or an array.
Parameters:
All parameters to new() are named.
bvec=>$bit_vector
Stores a Perl bit vector in the object.
my
$vec
= str2bit
'01110011110001111100001111110001'
;
my
$bv
= Data::Bvec::->new(
bvec
=>
$vec
);
bstr=>$compressed_bit_string
Stores a compressed bit string in the object.
my
$bstr
= compress bit2str
$vec
;
my
$bv
= Data::Bvec::->new(
bstr
=>
$bstr
);
nums=>\@integers
Stores an array of integers in the object. The order of the array is retained when stored.
my
$nums
= bit2num
$vec
;
my
$bv
= Data::Bvec::->new(
nums
=>
$nums
);
bvec2nums=>$bit_vector
Accepts a bit vector and stores it as an array of integers (as $self->{nums}).
my
$bv
= Data::Bvec::->new(
bvec2nums
=>
$vec
);
nums2bvec=>\@integers
Accepts an array of integers and stores it as a bit vector (as $self->{bvec}). The order of the array is not retained.
my
$bv
= Data::Bvec::->new(
nums2bvec
=>
$nums
);
bvec2bstr=>$bit_vector
Accepts a bit vector and stores it as a compressed bit string (as $self->{bstr}).
my
$bv
= Data::Bvec::->new(
bvec2bstr
=>
$vec
);
bstr2bvec=>$compressed_bit_string
Accepts a compressed bit string and stores it as a bit vector (as $self->{bvec}).
my
$bv
= Data::Bvec::->new(
bstr2bvec
=>
$bstr
);
bstr2nums=>$compressed_bit_string
Accepts a compressed bit string and stores it as an array of integers (as $self->{nums}).
my
$bv
= Data::Bvec::->new(
bstr2nums
=>
$bstr
);
nums2bstr=>\@integers
Accepts an array of integers and stores it as a compressed bit string (as $self->{bstr}). The order of the array is not retained.
my
$bv
= Data::Bvec::->new(
nums2bstr
=>
$nums
);
get_bvec()
This routine takes no parameters. It returns a bit vector, regardless how the integers are stored. The object is not changed.
my
$vec
=
$bv
->get_bvec();
get_bstr()
This routine takes no parameters. It returns a compressed bit string, regardless how the integers are stored. The object is not changed.
my
$bstr
=
$bv
->get_bstr();
get_nums( $beg, $cnt )
This routine returns an array of integers, regardless how the integers are stored. The object is not changed.
Note that if the integers are stored as 'nums' (an array), get_nums() will return them in the same order as the array. If they are stored another way, they will be returned in ascending order.
my
@integers
=
$bv
->get_nums();
# list returned in list context
my
$ints
=
$bv
->get_nums();
# aref returned in scalar context
Parameters:
$beg
The first integer to return. Unlike array subscripts, the $beg positions start with 1, not 0. If no $beg is given, 1 is assumed.
$cnt
The maximum number of integers (including the first) to return. If no $cnt is given, the rest of the integers are returned.