The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Spreadsheet::XLSX::Reader::LibXML::SharedStrings - Read xlsx sharedStrings files with LibXML

SYNOPSIS

        #!/usr/bin/env perl
        $|=1;
        use Data::Dumper;
        use MooseX::ShortCut::BuildInstance qw( build_instance );
        use Spreadsheet::XLSX::Reader::LibXML::Error;
        use Spreadsheet::XLSX::Reader::LibXML::XMLReader::SharedStrings;

        my $file_instance = build_instance(
            package      => 'SharedStringsInstance',
            superclasses => ['Spreadsheet::XLSX::Reader::LibXML::XMLReader::SharedStrings'],
            file         => 'sharedStrings.xml',
            error_inst   => Spreadsheet::XLSX::Reader::LibXML::Error->new,
        );
        print Dumper( $file_instance->get_shared_string_position( 3 ) );
        print Dumper( $file_instance->get_shared_string_position( 12 ) );

        #######################################
        # SYNOPSIS Screen Output
        # 01: $VAR1 = {
        # 02:     'raw_text' => ' '
        # 03: };
        # 04: $VAR1 = {
        # 05:     'raw_text' => 'Superbowl Audibles'
        # 06: };
        #######################################

DESCRIPTION

This documentation is written to explain ways to use this module when writing your own excel parser or extending this package. To use the general package for excel parsing out of the box please review the documentation for Workbooks , Worksheets , and Cells.

This general class is written to get useful data from the sub file 'sharedStrings.xml' that is a member of a zipped (.xlsx) archive. The file to be read is generally found in the xl/ sub folder of the zip library. Sometimes it is found as a subset of a single xml tabular file. The sharedStrings.xml file contains a list of unique strings (not numbers) used as values in the spreadsheet cells of Excel. Uniqueness of a string is determined by upper case, lower case, and font formatting as well as full cell text. Partial common elements are not correlated.

This documentation is for the general explanation of the SharedStrings class. The example uses a class built with an XMLReader version at the core. Documentation specific to that parser can be found in the ~XMLReader::SharedStrings documentation. To replace or augment this class you would need to understand how it is built on the fly using MooseX::ShortCut::BuildInstance. Next you should fork this code on github . Then add or change the parts you want and re-point the package to use the new elements in the correct circumstance using the $parser_modules variable maintained in the Spreadsheet::XLSX::Reader::LibXML class. (about line 35).

Required Method(s)

These are the primary way(s) to use this class. For additional Styles options see the Attributes section. All replacement classes must provide these methods. Methods used to manipulate the attributes are listed in each attribute.

get_shared_string_position( $position )

    Definition: This will return the shared string in a hash_ref for the requested position. (Counting from zero)

    Accepts: $position = an integer for the styles $position. (required)

    Returns: a hash ref with the key = 'raw_text' and the value = the stored string

Attributes

Data passed to new when creating an instance. For modification of these attributes see the listed 'attribute methods'. For more information on attributes see Moose::Manual::Attributes. It may be that these attributes migrate based on the reader type.

file

    Definition: This needs to be the full file path to the sharedStrings file or an opened file handle . When a file path is sent it will coerce to a file handle and then will open and read the primary settings in the sharedStrings.xml file and then maintain an open file handle for accessing specific sharedStrings position information.

    Required: Yes

    Default: none

    Range an actual Excel 2007+ sharedStrings.xml file or open file handle (with the pointer set to the beginning of the file)

    attribute methods Methods provided to adjust this attribute

      get_file

        Definition: Returns the value (file handle) stored in the attribute

      set_file

        Definition: Sets the value (file handle) stored in the attribute. Then triggers a read of the file level unique bits.

      has_file

        Definition: predicate for the attribute

error_inst

    Definition: Currently all ShareStrings readers require an Error instance. In general the package will share an error instance reference between the workbook and all classes built during the initial workbook build.

    Required: Yes

    Default: none

    Range: The minimum list of methods to implement for your own instance is;

            error set_error clear_error set_warnings if_warn

    attribute methods Methods provided to adjust this attribute

      get_error_inst

        Definition: returns this instance

      error

        Definition: Used to get the most recently logged error

      set_error

        Definition: used to set a new error string

      clear_error

        Definition: used to clear the current error string in this attribute

      set_warnings

        Definition: used to turn on or off real time warnings when errors are set

      if_warn

        Definition: a method mostly used to extend this package and see if warnings should be emitted.

cache_positions

    Definition: Especially for sheets with lots of stored text the parser can slow way down when accessing each postion. This is because an XML::LibXML Reader cannot rewind but must start from the beginning and index through the file till it gets to the target position. This is complicated by the fact that the shared strings are not necessarily stored in a logical or cell order. This is especially true for excel sheets that have experienced any significant level of manual intervention prior to being read. This attribute turns (default) on caching for shared strings so the parser only has to read through the shared strings once. When the read is complete all the way to the end it will also release the shared strings file in order to free up some space. (a small win in exchange for the space taken by the cache). The trade off here is that all intermediate shared strings are fully read before reading the target string. This means early reads will be slower. For sheets that only have numbers stored or at least have very few strings this will likely not be a large startup hit (or speed improvement). The risk obviously is that the cach will impact memory. You can use this attribute to turn off caching but it is most likely that a cache of that size will necessitate the sheet read to slow way down! The tradeoff of course is the parser shouldn't die. In order to minimize the physical size of the cache if there is only a text string stored in the shared strings position then only the string will be stored (not the definition that only a string exists).

    Default: 1 = caching is on

    Range: 1|0

    Attribute required: yes

    attribute methods Methods provided to adjust this attribute

no_formats

    Definition: Quite often the goal of reading a spreadsheet is to get at the data in the cells and not read the visible presentation of the sheet. If so reading the sharedStrings file can be sped up by skipping the stored text formatting when reading from the xml. This flag will manage that choice.

    Default: 0 = format reading is on

    Range: 0|1

    Attribute required: yes

    attribute methods Methods provided to adjust this attribute

SUPPORT

TODO

    1. Write a DOM version of the parser

AUTHOR

Jed Lund
jandrew@cpan.org

COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

This software is copyrighted (c) 2014, 2015 by Jed Lund

DEPENDENCIES

SEE ALSO