The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

File::Tabular - searching and editing flat tabular files

SYNOPSIS

  use File::Tabular;
  my $f = new File::Tabular($filename);

  my $row = $f->fetchrow;
  print $row->{field1}, $row->{field2};

  $row = $f->fetchrow(where => 'someWord');
  $row = $f->fetchrow(where => 'field1 > 4 AND field2 >= "01.01.2001"');
  $row = $f->fetchrow(where => qr/some\s+(complex\s*)?(regex|regular expression)/i);

  $f->rewind;
  my $rows = $f->fetchall(where => 'someField =~ ^[abc]+');
  print $_->{someField} foreach @$rows;

  $f->rewind;
  $rows = $f->fetchall(where => '+field1:someWord -field2:otherWord',
                       orderBy => 'field3, field6:num, field5:-alpha');

  $f->rewind;
  my $hashRows = $f->fetchall(where   => 'foo AND NOT bar',
                              key     => 'someField');
  print $hashRows->{someKey}{someOtherField};

  # open for updates, and remember the updates in a journal file
  $f = new File::Tabular("+<$filename", {journal => ">>$journalFile"});

  # updates at specific positions (line numbers)
  $f->splices(4  => 2, undef,   # delete 2 lines from position 4
              7  => 1, {f1 => $v1, f2 => $v2, ...}, # replace line 7
              9  => 0, { ...},   # insert 1 new line at position 9
              22 => 0, [{...}, {...}, ...] # insert several lines at pos. 22
              ...
              -1 => 0, [{...}, {...}, ...] # append at the end
           );

  # shorthand to add new data at the end
  $f->append({f1 => $v1, f2 => $v2, ...});
  # same thing, but use the "Hash::Type" associated to the file
  $f->append($f->ht->new($v1, $v2, ...)); 


  $f->clear;                    # removes all data (but keeps the header line)

  # updates at specific keys, corresponding to @keyFields
  $f->writeKeys({key1 => {f1 => $v1, f2 => $v2, ...}, # add or update
                 key2 => undef,                       # remove 
                 ...
                 }, @keyFields);


  # replay the updates on a backup file
  my $bck = new File::Tabular("+<$backupFile");
  $bck->playJournal($journalFile);

  # get info from associated filehandle
  printf "%d size, %d blocks", $f->stat->{size}, $f->stat->{blocks};
  my $mtime = $f->mtime;
  printf "time last modified : %02d:%02d:%02d", @{$mtime}{qw(hour min sec)};

DESCRIPTION

A tabular file is a flat text file containing data organised in rows (records) and columns (fields).

This module provides database-like functionalities for managing tabular files : retrieving, searching, writing, autonumbering, journaling. However, unlike other modules like DBD::CSV, it doesn't try to make it look like a database : rather, the API was designed specifically for work with tabular files. Instead of SQL, search queries are specified in a web-like fashion, with support for regular expressions and cross-field searches. Queries are compiled internally into perl closures before being applied to every data record, which makes it quite fast.

Write operations take a list of modifications as argument; then they apply the whole list atomically in a single rewrite of the data file.

Here are some of the reasons why you might choose to work with a tabular file rather than a regular database :

  • no need to install a database system (not even buy one)!

  • easy portability and data exchange with external tools (text editor, spreadsheet, etc.)

  • search queries immediately ready for a web application

  • good search performance, even with several thousand records

On the other hand, tabular files will probably be inappropriate if you need very large volumes of data, complex multi-table data models or frequent write operations.

METHODS

new (open1, open2, ..., {opt1 => v1, opt2 => v2, ...})

Creates a new tabular file object. The list of arguments open1, open2, ... is fed directly to "open" in perlfunc for opening the associated file. Can also be a reference to an already opened filehandle.

The final hash ref is a collection of optional parameters, taken from the following list :

fieldSep

field separator : any character except '%' ('|' by default). Escape sequences like \t are admitted.

recordSep

record separator ('\n' by default).

fieldSepRepl

string to substitute if fieldSep is met in the data. (by default, url encoding of fieldSep, i.e. '%7C' )

recordSepRepl

string to substitute if recordSep is met in the data (by default, url encoding of recordSep, i.e. '%0A' )

autoNumField

name of field for which autonumbering is turned on (none by default). This is useful to generate keys : when you write a record, the character '#' in that field will be replaced by a fresh number, incremented automatically. This number will be 1 + the largest number read so far (it is your responsability to read all records before the first write operation).

autoNum

initial value of the counter for autonumbering (1 by default).

autoNumChar

character that will be substituted by an autonumber when writing records ('#' by default).

flockMode

mode for locking the file, see "flock" in perlfunc. By default, this will be LOCK_EX if open1 contains '>' or '+<', LOCK_SH otherwise.

flockAttempts

Number of attempts to lock the file, at 1 second intervals, before returning an error. Zero by default. If nonzero, LOCK_NB is added to flockMode; if zero, a single locking attempt will be made, blocking until the lock is available.

headers

reference to an array of field names. If not present, headers will be read from the first line of the file.

printHeaders

if true, the headers will be printed to the file. If not specified, treated as 'true' if open1 contains '>'.

journal

name of journaling file, or reference to a list of arguments for "open" in perlfunc. The journaling file will log all write operations. If specified as a simple file name, it will be be opened in '>>' mode.

A journal file can then be replayed through method "playJournal" (this is useful to recover after a crash, by playing the journal on a backup copy of your data).

rxDate

Regular expression for matching a date. Default value is qr/^\d\d?\.\d\d?\.\d\d\d?\d?$/. This will be used by "compileFilter" to perform appropriate comparisons.

date2str

Ref to a function for transforming dates into strings suitable for sorting (i.e. year-month-day). Default is :

 sub {my ($d, $m, $y) = ($_[0] =~ /(\d\d?)\.(\d\d?)\.(\d\d\d?\d?)$/);
      $y += ($y > 50) ? 1900 : 2000 if defined($y) && $y < 100;
      return sprintf "%04d%02d%02d", $y, $m, $d;}
rxNum

Regular expression for matching a number. Default value is qr/^[-+]?\d+(?:\.\d*)?$/. This will be used by "compileFilter" to perform appropriate comparisons.

preMatch/postMatch

Strings to insert before or after a match when filtering rows (will only apply to search operator ':' on the whole line, i.e. query "foo OR bar" will highlight both "foo" and "bar", but query "~ 'foo' OR someField:bar" will not highlight anything; furthermore, a match-all request containing just '*' will not highlight anything either).

avoidMatchKey

If true, searches will avoid to match on the first field. So a request like $ft->fetchall(where => '123 OR 456') will not find the record with key 123, unless the word '123' appears somewhere in the other fields. This is useful when queries come from a Web application, and we don't want users to match a purely technical field.

This search behaviour will not apply to regex searches. So requests like $ft->fetchall(where => qr/\b(123|456)\b/) or $ft->fetchall(where => ' ~ 123 OR ~ 456') will actually find the record with key 123.

fetchrow(where => filter)

returns the next record matching the (optional) filter. If there is no filter, just returns the next record.

The filter is either a code reference generated by "compileFilter", or a string which will be automatically fed as argument to "compileFilter"; this string can contain just a word, a regular expression, a complex boolean query involving field names and operators, etc., as explained below.

fetchall(where => filter, orderBy => cmp)
fetchall(where => filter, key => keySpecif)

finds all next records matching the (optional) filter. If there is no filter, finds all remaining records.

The filter is either a code reference generated by "compileFilter", or a string which will be automatically fed as argument to "compileFilter".

The return value depends on context and on arguments :

  • if no key parameter is given, and we are in a scalar context, then fetchall returns a reference to an array of records.

    The optional orderBy parameter can be a field name, a ref to a list of field names, a string like "field1: -alpha, field2:-num, ...", or, more generally, a user-provided comparison function; see "cmp" in Hash::Type for a fully detailed explanation.

    Otherwise, the resulting array is in data source order.

  • if no key parameter is given, and we are in a list context, then fetchall returns a pair : the first item is a reference to an array of records as explained above ; the second item is a reference to an array of line numbers corresponding to those records (first data line has number 0). These line numbers might be useful later if you update the records through the "splices" method. No orderBy is allowed if fetchall is called in list context.

  • if a key parameter is given, then fetchall returns a reference to a hash, whose values are the retrieved records, and whose keys are built according to the keySpecif argument. This must be either a single field name (scalar), or a a list of field names (ref to an array of scalars). Values corresponding to those field names will form the key for each entry of the hash; if necessary, multiple values are joined together through $;. No orderBy argument is allowed, because hashes have no ordering.

rewind

Rewinds the file to the first data line (after the headers)

ht

Returns the instance of Hash::Type associated with the file.

headers

returns the list of field names

stat

returns a hash ref corresponding to a call of stat on the associated filehandle. Keys of the hash have names as documented in stat. Ex:

     printf "%d size, %d blocks", $f->stat->{size}, $f->stat->{blocks};
atime, mtime, ctime

each of these methods returns a hash ref corresponding to a call of localtime on the last access time, last modified time, or last inode change time of the associated filehandle (see stat for explanations). Keys of the hash have names as documented in localtime. Ex:

  my $mtime = $f->mtime;
  printf "time last modified : %02d:%02d:%02d", @{$mtime}{qw(hour min sec)};
splices
  splices(pos1 => 2, undef,           # delete 2 lines
          pos2 => 1, row,             # replace 1 line
          pos3 => 0, [row1, row2 ...] # insert lines
              ...
          -1   => 0, [row1, ...     ] # append lines
           );

           # special case : autonum if pos== -1

Updates the data, in a spirit similar to "splice" in perlfunc (hence the name of the method). The whole file is rewritten in an atomic operation, deleting, replacing or appending data lines as specified by the "splice instructions". Returns the number of "splice instructions" performed.

A splice instruction is a triple composed of :

  1. a position (line number) that specifies the place where modifications will occur. Line numbers start at 0. Position -1 means end of data.

  2. a number of lines to delete (might be zero).

  3. a ref to a hash or to a list of hashes containing new data to insert (or undef if there is no new data).

If there are several splice instructions, their positions must be sorted in increasing order (except of course position -1, meaning "end of data", which must appear last).

Positions always refer to line numbers in the original file, before any modifications. Therefore, it makes no sense to write

  splices(10 => 5, undef,
          12 => 0, $myRow)

because after deleting 5 rows at line 10, we cannot insert a new row at line 12.

The whole collection of splice instructions may also be passed as an array ref instead of a list.

If you intend to fetch rows again after a splice, you must rewind the file first.

append(row1, row2, ...)

This appends new records at the end of data, i.e. it is a shorthand for

  splices(-1 => 0, [row1, row2, ...])
clear

removes all data (but keeps the header line)

writeKeys({key1 => row1, key2 => ...}, @keyFields)

Rewrites the whole file, applying modifications as specified in the hash ref passed as first argument. Keys in this hash are compared to keys built from the original data, according to @keyFields. Therefore, row1 may replace an existing row, if the key corresponding to key1 was found ; otherwise, a new row is added. If row1 is undef, the corresponding row is deleted from the file.

@keyFields must contain the name of one or several fields that build up the primary key. For each data record, the values corresponding to those fields are taken and joined together through $;, and then compared to key1, key2, etc.

If you intend to fetch rows again after a writeKeys, you must rewind the file first.

playJournal(open1, open2, ...)

Reads a sequence of update instructions from a journal file and applies them to the current tabular file. Arguments open1, open2, ... will be passed to perl open for opening the journal file ; in most cases, just give the filename.

The journal file must contain a sequence of instructions as encoded by the automatic journaling function of this module ; to activate journaling, see the journal parameter of the "new" method.

compileFilter(query [, implicitPlus])

Compiles a query into a filter (code reference) that can be passed to "fetchrow" or "fetchall".

The query can be

  • a regular expression compiled through qr/.../. The regex will be applied to whole data lines, and therefore covers all fields at once. This is the fastest way to filter lines, because it avoids systematic splitting into data records.

  • a data structure resulting from a previous call to Search::QueryParser::parse

  • a string of shape K_E_Y : value (without any spaces before or after ':'). This will be compiled into a regex matching value in the first column. The special spelling is meant to avoid collision with a real field hypothetically named 'KEY'.

  • a string that will be analyzed through Search::QueryParser, and then compiled into a filter function. The query string can contain boolean combinators, parenthesis, comparison operators, etc., as documented in Search::QueryParser. The optional second argument implicitPlus is passed to Search::QueryParser::parse ; if true, an implicit '+' is added in front of every query item (therefore the whole query is a big AND).

    Notice that in addition to usual comparison operators, you can also use regular expressions in queries like

      +field1=~'^[abc]+' +field2!~'foobar$'

    The query compiler needs to distinguish between word and non-word characters ; therefore it is important to use locale in your scripts (see perllocale). The compiler tries to be clever about a number of details :

    looking for complete words

    Words in queries become regular expressions enclosed by \b (word boundaries) ; so a query for foo OR bar will not match foobar.

    supports * for word completion

    A '*' in a word is compiled into regular expression \w* ; so queries foo* or *bar will both match foobar.

    case insensitive, accent-insensitive

    Iso-latin-1 accented characters are translated into character classes, so for example hétaïre becomes qr/h[ée]ta[ïi]re/i. Furthermore, as shown in this example, the i flag is turned on (case-insensitive). Therefore this query will also match HETAIRE.

    numbers and dates in operators

    When compiling a subquery like fieldname >= 'value', the compiler checks the value against rxNum and rxDate (as specified in the "new" method). Depending on these tests, the subquery is translated into a string comparison, a numerical comparison, or a date comparison (more precisely, {date2str($a) cmp date2str($b)}).

    set of integers

    Operator # means comparison with a set of integers; internally this is implemented with a bit vector. So query Id#2,3,5,7,11,13,17 will return records where field Id contains one of the listed integers. The field name may be omitted if it is the first field (usually the key field).

    pre/postMatch

    Words matched by a query can be highlighted; see parameters preMatch and postMatch in the "new" method.

AUTHOR

Laurent Dami, <laurent.dami AT etat ge ch>

COPYRIGHT AND LICENSE

Copyright (C) 2005 by Laurent Dami.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 1021:

Non-ASCII character seen before =encoding in 'C<hétaïre>'. Assuming CP1252