The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Spreadsheet::ReadGnumeric - read a Gnumeric file, return Spreadsheet::Read

VERSION

Version 0.1

SYNOPSIS

    use Spreadsheet::ReadGnumeric;

    my $reader = Spreadsheet::ReadGnumeric->new
        (rc => 1, cells => 1);  # these are the defaults
    my $book = $reader->parse_file('spreadsheet.gnumeric');
    my $n_sheets = $book->[0]{sheets};
    my $sheet_1 = $book->[1];
    my $b12 = $sheet_1->{B12};
    # or equivalently:
    my $b12 = $sheet_1->{cell}[2][12];
    say $b12;   # => "not a vitamin", e.g.

Note that Gnumeric saves expressions in cells, and not the result of expressions, and even the expressions are sometimes encoded in a way that Spreadsheet::ReadGnumeric cannot not completely unparse, so the returned data structure will look less well populated than the spreadsheet does in Gnumeric. See the "ExprID" TO DO item below.

METHODS

After creating an instance with "new", the public entrypoint for parsing a Gnumeric file is "parse_file", and for a stream is "parse_gnumeric_stream".

The rest of these deal with the XML parsing mechanism and are probably not of interest (unless you need to write a subclass in order to extract more information).

cells

Slot that contains the value of the Spreadsheet::Read "cells" option. When true, this populates the $sheet->{$cell_name} with the corresponding cell data, where $cell_name is an alphanumeric cell name such as "B17" or "AA3". Defaults to a true value.

chars

Slot that contains the text content of an XML element being parsed.

current_attrs

Slot that contains the attribute of an element being parsed.

current_elt

Slot that contains the name of an element being parsed, including any namespace prefix (see namespaces).

element_stack

Slot that contains an arrayref that is a stack of containing element contexts. The last entry in this is the parent of the current element being parsed, and is itself an arrayref of the values of

    [ current_elt, current_attrs, chars, namespaces ]

while that containing element was being parsed.

gzipped_p

Slot that determines whether parse_file will test whether its argument is gzipped. There are three possibilities: Undefined, other false, and true. If undefined, then parse_file will open the file, make the test, and then reset the stream to the beginning (see "stream_gzipped_p"). Otherwise, parse_file will take the caller's word for it.

namespaces

Slot that contains a hash of namespace prefix to defining URL, for all prefixes in scope. These are never modified by side effect, so that they can be copied into inner scopes. We use this to decide whether any given name is really in the schema we care about.

new

Creates a new Spreadsheet::ReadGnumeric instance. Keywords are any of the method names listed as slots.

parse

Given an input source, reads it and returns the Spreadsheet::Read-compatible data structure, constructed according to the options set up at instantiation time. If the "gzipped_p" slot is defined or the content starts with gzip magic (see "stream_gzipped_p"), which is the normal state of saved Gnumeric spreadsheets, it is uncompressed while being read.

The input can be a reference (which is taken to be an open stream), a file name, or a literal string that contains the data. If given an non-seekable stream, the stream_gzipped_p slot must be defined in order to skip the test for compressed input.

rc

Slot that contains the value of the Spreadsheet::Read "rc" option. When true, this populates the $sheet->{cell}[$col][$row] arrays. Defaults to a true value.

sheet

Slot that contains the sheet currently under construction.

sheets

Read-only slot that contains the resulting Spreadsheet::Read-compatible data structure. This slot is initialized when its value is first requested; it is expected that the value is then updated by side-effect.

stream_gzipped_p

Given a stream open to the start of the data and an optional file name, determine whether we need to uncompress the data and then reset the stream back to the beginning. If the gzipped_p slot is defined, then we just return that and leave the stream untouched. The file name is only used for error messages.

INTERNALS

Spreadsheet::ReadGnumeric uses XML::Parser::Lite to decode the XML into its component elements, calling the _process_Foo_elt method at each </Foo> closing tag (explicit or implicit), where "Foo" is the case-sensitive element name. If no such method exists, the element is ignored, though its content may already have been processed. For instance, we don't care about <Sheets> because it exists only as a container for <Sheet> elements, which we do process.

Elements with XMLNS prefixes (using the ":" separator) must be associated with the Gnumeric schema in order to be visible to this parsing mechanism. Elements without prefixes are assumed to be part of the Gnumeric schema (but the only Gnumeric files I've seen so far use prefixes on all elements). Note that element names stored in current_elt and element_stack have already been stripped of any Gnumeric-specific prefixes.

When the _process_Foo_elt method is called, it can examine the "current_attrs" hash for its attributes and the "chars" string for its character content; all element content will already have been processed. If necessary, it can examine the "element_stack" slot to check context. For example, _process_Name_elt needs to do this because it only processes Sheet element names. The element_stack also makes the attributes of enclosing elements available.

Consequently, extending Spreadsheet::ReadGnumeric to extract more information out of Gnumeric files should be a simple matter of defining a subclass with suitable _process_*_elt methods that examine their context, extract their content, and stuff the result into the sheet and/or sheets slot values accordingly. At least that's my theory, and I'm sticking to it.

TODO

Note that the "clip", "strip", and "pivot" options of Spreadsheet::Read are handled by its _clipsheets sub, so I do not intend to do anything about them here.

These are not in any necessary order of importance.

  • Handle "ExprID". These are Cell attributes, e.g.:

            <gnm:Cell Row="10" Col="7" ExprID="1"/>

    Looking at the test data, they seem to indicate that the missing content is the expression in the earlier cell with the same ID after shifting by the location difference. This requires parsing equations, though.

  • Maybe implement the "attr" option (could be expensive).

BUGS

Please report any bugs or feature requests to bug-spreadsheet-readgnumeric at rt.cpan.org, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=Spreadsheet-ReadGnumeric.

SEE ALSO

The Gnumeric Spreadsheet

See http://www.gnumeric.org/.

Spreadsheet::Read

Frontend to multiple spreadsheet formats, with additional options for manipulating the result, plus the option to give them object capabilities. This module also describes the output data structure in depth.

The Gnumeric File Format

This is a PDF file by David Gilbert dated 2001-11-05, available at https://www.jfree.org/jworkbook/download/gnumeric-xml.pdf (GFDL). I am not aware of any more recent version.

AUTHOR

Bob Rogers <rogers at rgrjr.com>, based heavily on the Spreadsheet::Read architecture.

LICENSE AND COPYRIGHT

This software is Copyright (c) 2022 by Bob Rogers.

This is free software, licensed under:

  The Artistic License 2.0 (GPL Compatible)