The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Spreadsheet::ReadGnumeric - read a Gnumeric file, return Spreadsheet::Read

VERSION

Version 0.4

SYNOPSIS

    use Spreadsheet::ReadGnumeric;

    my $reader = Spreadsheet::ReadGnumeric->new
        (rc => 1, cells => 1,   # these are the defaults
         process_formulas => 0);
    my $book = $reader->parse('spreadsheet.gnumeric');
    my $n_sheets = $book->[0]{sheets};
    my $sheet_1 = $book->[1];
    my $b12 = $sheet_1->{B12};
    # or equivalently:
    my $b12 = $sheet_1->{cell}[2][12];
    say $b12;   # => "not a vitamin", e.g.

DESCRIPTION

Given a source of saved Gnumeric data, Spreadsheet::ReadGnumeric parses it and returns the result in the same format as Spreadsheet::Read. In fact, Spreadsheet::Read has a few more bells and whistles, such as transposing the resulting sheets and providing an object-oriented interface to the resulting data, so it may be easier to access this module through Spreadsheet::Read, even if you only want to parse Gnumeric data.

Note that Gnumeric only saves raw cell values, and not their formatted versions. In particular, Gnumeric saves formulas, and not formula values, never mind formatted formula values.

METHODS

After creating an instance with "new", call "parse" with a file name, a stream, or a string containing the spreadsheet data. The "attr", "cells", "convert_colors" "gzipped_p", "merge", "minimal_attributes", and "rc" slots control how Spreadsheet::ReadGnumeric parses its input. They may also be used as initializer keywords to the "new" method.

The rest of these deal with the XML parsing mechanism and are probably not of interest (unless you need to write a subclass in order to extract more information).

attr

Slot that, if true, directs the parsing methods to retain style information in the returned $sheet->{attr} array of each sheet. If the value of this slot is "keep", an arrayref of each sheet's Spreadsheet::Gnumeric::StyleRegion instances is saved in $sheet->{style_regions}.

The attr value is normally supplied to the new method as an initializer. See the Spreadsheet::Read Cell Attributes documentation for details of the resulting attributes. See Spreadsheet::Gnumeric::StyleRegion for details of how attribute conversion is performed.

See also the "merge" slot, which requests additional processing of cell merge information.

cells

Slot that contains the value of the Spreadsheet::Read "cells" option. When true, this populates the $sheet->{$cell_name} with the corresponding cell data, where $cell_name is an alphanumeric cell name such as "B17" or "AA3". Defaults to a true value.

chars

Slot that contains the text content of an XML element being parsed.

convert_colors

Slot that contains a boolean which, if defined and true, causes Gnumeric colors to be converted to the more common HTML-style format. Gnumeric specifies colors as RGB triples using hex numbers of one to four digits for each primary color, separated by colons, so that white is "FFFF:FFFF:FFFF", and cyan is "0:FC00:F800". Conversion pads each primary to four digits, takes the first two, and prepends "#", producing "#FFFFFF" for white and "#00FCF8" for cyan.

For Spreadsheet::Read compatibility, the default is true. This value is ignored unless "attr" is true.

current_attrs

Slot that contains the attribute of an element being parsed.

current_elt

Slot that contains the name of an element being parsed, including any namespace prefix (see namespaces).

element_stack

Slot that contains an arrayref that is a stack of containing element contexts. The last entry in this is the parent of the current element being parsed, and is itself an arrayref of the values of

    [ current_elt, current_attrs, chars, namespaces ]

while that containing element was being parsed.

find_style_for_cell

Given zero-based column and row indices, look in "style_regions" for a suitable Spreadsheet::Gnumeric::StyleRegion instance, and return its attribute hash.

formulas

Slot that contains an arrayref indexed by the "ExprID" attribute of formula cells. Each entry is itself an arrayref of four elements, the zero-based column and row indices where the formula is defined, the original formula as a string (complete with leading "="), and the tokenized formula; this is initially "undef" and only filled in if the formula is referenced elsewhere. The "process_formulas" slot describes how this is done.

gzipped_p

Slot that determines whether parse will test whether its argument is gzipped. There are three possibilities: Undefined, other false, and true. If undefined, then parse will open the file, make the test, and then reset the stream to the beginning (see "stream_gzipped_p"). Otherwise, parse will take the caller's word for it.

merge

Slot that requests additional processing of cell merging information. If true, (a) the value of the merged attribute for merged cells is the name of the upper left corner of the rectangle of merged cells, and (b) the contents and attributes of that upper-left cell are copied into all of the others. The default is false.

This flag is ignored unless the "attr" slot is also true.

minimal_attributes

Slot that, if true, causes attributes to be filled in only for cells that have content, i.e. actually appear as a <Cell> in the file with a nonempty string string. Otherwise, attributes are filled in for the full $sheet->{maxrow} vs $sheet->{maxcol} array. The default is false.

This flag is ignored unless the "attr" slot is also true.

namespaces

Slot that contains a hash of namespace prefix to defining URL, for all prefixes in scope. These are never modified by side effect, so that they can be copied into inner scopes. We use this to decide whether any given name is really in the schema we care about.

new

Creates a new Spreadsheet::ReadGnumeric instance. Keywords are any of the method names listed as slots.

parse

Given an input source, reads it and returns the Spreadsheet::Read-compatible data structure, constructed according to the options set up at instantiation time. If the "gzipped_p" slot is defined and true or the content starts with gzip magic (see "stream_gzipped_p"), which is the normal state of saved Gnumeric spreadsheets, it is uncompressed while being read. If gzipped_p is not defined, the stream is read to see how it starts and then rewound.

The input can be a reference (which is taken to be an open stream), a file name, or a literal string that contains the data. If given an non-seekable stream, the stream_gzipped_p slot must be defined in order to skip the test for compressed input.

process_formulas

Slot which, if true, requests that spreadsheet parsing should go to the trouble of reconstituting formulas. The default is false, which may change, since formula processing probably doesn't cost very much.

When Gnumeric copies a formula, it offsets the cell names referenced in the formula by the change in location. When the spreadsheet is saved, if neither formula is modified, the first time the equation is mentioned it is saved as a string with a unique "ExprID" and the second time the same "ExprID" is cited and the string is empty. If process_formulas is false, the second and all subsequent references to the formula are left empty, but if process_formulas is true, the correct equation is reconstituted for each cell in the same way the copies were created by Gnumeric.

rc

Slot that contains the value of the Spreadsheet::Read "rc" option. When true, this populates the $sheet->{cell}[$col][$row] arrays. Defaults to a true value.

sheet

Slot that contains the sheet currently under construction.

sheets

Read-only slot that contains the resulting Spreadsheet::Read-compatible data structure. This slot is initialized when its value is first requested; it is expected that the value is then updated by side-effect.

stream_gzipped_p

Given a stream open to the start of the data and an optional file name, determine whether we need to uncompress the data and then reset the stream back to the beginning. If the gzipped_p slot is defined, then we just return that and leave the stream untouched. The file name is only used for error messages.

style_attributes

Internal slot used to store a hashref of spreadsheet attributes during parsing.

style_regions

Slot that contains an arrayref of Spreadsheet::Gnumeric::StyleRegion instances, built by the style parsing code for each sheet and cleared when the sheet is completed.

INTERNALS

Spreadsheet::ReadGnumeric uses XML::Parser::Lite to decode the XML into its component elements, calling the _process_Foo_elt method at each </Foo> closing tag (explicit or implicit), where "Foo" is the case-sensitive element name. If no such method exists, the element is ignored, though its content may already have been processed. For instance, we don't care about <Sheets> because it exists only as a container for <Sheet> elements, which we do process.

Elements with XMLNS prefixes (using the ":" separator) must be associated with the Gnumeric schema in order to be visible to this parsing mechanism. Elements without prefixes are assumed to be part of the Gnumeric schema (but the only Gnumeric files I've seen so far use prefixes on all elements). Note that element names stored in current_elt and element_stack have already been stripped of any Gnumeric-specific prefixes.

When the _process_Foo_elt method is called, it can examine the "current_attrs" hash for its attributes and the "chars" string for its character content; all element content will already have been processed. If necessary, it can examine the "element_stack" slot to check context. For example, _process_Name_elt needs to do this because it only processes Sheet element names. The element_stack also makes the attributes of enclosing elements available.

Consequently, extending Spreadsheet::ReadGnumeric to extract more information out of Gnumeric files should be a simple matter of defining a subclass with suitable _process_*_elt methods that examine their context, extract their content, and stuff the result into the sheet and/or sheets slot values accordingly. At least that's my theory, and I'm sticking to it.

TODO

Note that the "clip", "strip", and "pivot" options of Spreadsheet::Read are handled by its _clipsheets sub, so I do not intend to do anything about them here.

BUGS

Please report any bugs or feature requests to bug-spreadsheet-readgnumeric at rt.cpan.org, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=Spreadsheet-ReadGnumeric.

SEE ALSO

The Gnumeric Spreadsheet

See http://www.gnumeric.org/.

Spreadsheet::Read

Frontend to multiple spreadsheet formats, with additional options for manipulating the result, plus the option to give them object capabilities. This module also describes the output data structure in depth.

The Gnumeric File Format

This is a PDF file by David Gilbert dated 2001-11-05, available at https://www.jfree.org/jworkbook/download/gnumeric-xml.pdf (GFDL). I am not aware of any more recent version.

AUTHOR

Bob Rogers <rogers at rgrjr.com>, based heavily on the Spreadsheet::Read architecture.

LICENSE AND COPYRIGHT

This software is Copyright (c) 2023 by Bob Rogers.

This is free software, licensed under:

  The Artistic License 2.0 (GPL Compatible)