NAME

Spreadsheet::Reader::ExcelXML - Read xlsx/xlsm/xml extention Excel files

SYNOPSIS

The following uses the 'TestBook.xlsx' file found in the t/test_files/ folder of the package

        #!/usr/bin/env perl
        use strict;
        use warnings;
        use Spreadsheet::Reader::ExcelXML;

        my $parser   = Spreadsheet::Reader::ExcelXML->new();
        my $workbook = $parser->parse( 'TestBook.xlsx' );

        if ( !defined $workbook ) {
                die $parser->error(), "\n";
        }

        for my $worksheet ( $workbook->worksheets() ) {

                my ( $row_min, $row_max ) = $worksheet->row_range();
                my ( $col_min, $col_max ) = $worksheet->col_range();

                for my $row ( $row_min .. $row_max ) {
                        for my $col ( $col_min .. $col_max ) {

                                my $cell = $worksheet->get_cell( $row, $col );
                                next unless $cell;

                                print "Row, Col    = ($row, $col)\n";
                                print "Value       = ", $cell->value(),       "\n";
                                print "Unformatted = ", $cell->unformatted(), "\n";
                                print "\n";
                        }
                }
                last;# In order not to read all sheets
        }

        ###########################
        # SYNOPSIS Screen Output
        # 01: Row, Col    = (0, 0)
        # 02: Value       = Category
        # 03: Unformatted = Category
        # 04:
        # 05: Row, Col    = (0, 1)
        # 06: Value       = Total
        # 07: Unformatted = Total
        # 08:
        # 09: Row, Col    = (0, 2)
        # 10: Value       = Date
        # 11: Unformatted = Date
        # 12:
        # 13: Row, Col    = (1, 0)
        # 14: Value       = Red
        # 16: Unformatted = Red
        # 17:
        # 18: Row, Col    = (1, 1)
        # 19: Value       = 5
        # 20: Unformatted = 5
        # 21:
        # 22: Row, Col    = (1, 2)
        # 23: Value       = 2017-2-14 #(shows as 2/14/2017 in the sheet)
        # 24: Unformatted = 41318
        # 25:
        # More intermediate rows ...
        # 82:
        # 83: Row, Col    = (6, 2)
        # 84: Value       = 2016-2-6 #(shows as 2/6/2016 in the sheet)
        # 85: Unformatted = 40944
        ###########################

DESCRIPTION

This is an Excel spreadsheet reading package that should parse all excel files with the extentions .xlsx, .xlsm, .xml Excel 2003 xml (SpreadsheetML ) that can be opened in the Excel 2007+ applications. The quick-start example provided in the SYNOPSIS attempts to follow the example from Spreadsheet::ParseExcel (.xls binary file reader) as close as possible. There are additional methods and other approaches that can be used by this package for spreadsheet reading but the basic access to data from newer xml based Excel files can be as simple as above.

This is not the only perl package able to parse .xlsx files on METACPAN. For now it does appear to be the only package that will parse .xlsm and Excel 2003 .xml workbooks.

There is some documentation throughout this package for users who intend to extend the package but the primary documentation is intended for the person who uses the package as is. Parsing through an Excel workbook is done with three levels of classes;

Workbook level (This doc)

General attribute settings that affect parsing of the file in general
The place to set workbook level output formatting
Object methods to retreive document level metadata and worksheets

Worksheet level

Object methods to return specific cell instances/data
Access to some worksheet level format information (more access pending)
The place to customize data output formats targeting specific cell ranges

Cell level

Access to the cell contents
Access to the cell formats (more access pending)

There are some differences from the Spreadsheet::ParseExcel package. For instance in the SYNOPSIS the '$parser' and the '$workbook' are actually the same class for this package. You could therefore combine both steps by calling ->new with the 'file' attribute called out. The test for load success would then rely on the method file_opened. Afterward it is still possible to call ->error on the instance. Another difference is the data formatter and specifically date handling. This package leverages Spreadsheet::Reader::Format to allows for a simple pluggable custom output format that is very flexible as well as handling dates in the Excel file older than 1-January-1900. I leveraged coercions from Type::Tiny to do this but anything that follows that general format will work here.

The why and nitty gritty of design choices I made are in the Architecture Choices section. Some pitfalls are outlined in the Warnings section. Read the full documentation for all opportunities!

Primary Methods

These are the primary ways to use this class. They can be used to open a workbook, investigate information at the workbook level, and provide ways to access sheets in the workbook.

All methods are object methods and should be implemented on the object instance.

Example:

        my @worksheet_array = $workbook_instance->worksheets;

parse( $file_name|$file_handle, $formatter )

Definition: This is a convenience method to match "parse($filename, $formatter)" in Spreadsheet::ParseExcel. It is one way to set the file attribute [and the formatter_inst attribute].

Accepts:

        $file = see the L<file|/file> attribute for valid options (required) (required)
        [$formatter] = see the L<formatter_inst|/formatter_inst> attribute for valid options (optional)

Returns: an instance of the package (not cloned) when passing with the xlsx file successfully opened or undef for failure.

worksheets

Definition: This method will return an array (not an array reference) containing a list of references to all worksheets in the workbook as objects. This is not a reccomended method because it builds all worksheet instance and returns an array of objects. It is provided for compatibility to Spreadsheet::ParseExcel. For alternatives see the get_worksheet_names method and the worksheet methods. It also only returns the tabular worksheets in the workbook. All chart sheets are ignored!

Accepts: nothing

Returns: an array ref of Worksheet objects for all worksheets in the workbook.

worksheet( $name )

Definition: This method will return an object to read values in the identified worksheet. If no value is passed to $name then the 'next' worksheet in physical order is returned. 'next' will NOT wrap It also only iterates through the 'worksheets' in the workbook (not the 'chartsheets').

Accepts: the $name string representing the name of the worksheet object you want to open. This name is the word visible on the tab when opening the spreadsheet in Excel. (not the underlying zip member file name - which can be different. It will not accept chart tab names.)

Returns: a Worksheet object with the ability to read the worksheet of that name. It returns undef and sets the error attribute if a 'chartsheet' is requested. Or in 'next' mode it returns undef if past the last sheet.

Example: using the implied 'next' worksheet;

        while( my $worksheet = $workbook->worksheet ){
                print "Reading: " . $worksheet->name . "\n";
                # get the data needed from this worksheet
        }

file_name

Definition: If you pass a file $location/$name string to the attribute file then before the package converts it to a file handle it will store the string. You can retreive that string with this method. This is true if you pass a string to the parse method as well.

Accepts: nothing

Returns: the $location/$name file string if available.

file_opened

Definition: This method is the test for success that should be used when opening a workbook using the ->new method. This allows for the object to store the error without dying entirely.

Accepts: nothing

Returns: 1 if the workbook file was successfully opened

Example:

        use Spreadsheet::Reader::ExcelXML qw( :just_the_data );

        my $workbook  = Spreadsheet::Reader::ExcelXML->new( file => 'TestBook.xlsx' );

        if ( !$workbook->file_opened ) {
                die $workbook->error(), "\n";
        }

        for my $worksheet ( $workbook->worksheets ) {

                print "Reading worksheet named: " . $worksheet->get_name . "\n";

                while( 1 ){
                        my $cell = $worksheet->get_next_value;
                        print "Cell is: $cell\n";
                        last if $cell eq 'EOF';
                }
        }

get_sheet_names

Definition: This method returns an array ref of all the sheet names (tabs) in the workbook in order. (It includes chartsheets.)

Accepts: nothing

Returns: an array ref of strings

worksheet_name( $position )

Definition: This returns the name of the worksheet in that $position. (counting from zero) interspersed chartsheets in the file are not considered to hold a position by this accounting.

Accepts: $position (an integer)

Returns: the worksheet name

Example: To return only worksheet positions 2 through 4 without building them all at once

        for $x (2..4){
                my $worksheet = $workbook->worksheet( $workbook->worksheet_name( $x ) );
                # Read the worksheet here
        }

get_worksheet_names

Definition: This method returns an array ref of all the worksheet names (tabs) in the workbook in order. (No chartsheets.)

Accepts: nothing

Returns: an array ref of strings

Example: Another way to parse a workbook without building all the sheets at once is;

        for $sheet_name ( @{$workbook->worksheet_names} ){
                my $worksheet = $workbook->worksheet( $sheet_name );
                # Read the worksheet here
        }

worksheet_count

Definition: This returns the total number of recorded worksheets

Accepts: nothing

Returns: $total - a count of all worksheets (only)

Attributes

Data passed to new when creating an instance. For modification of these attributes see the listed 'attribute methods'. For general information on attributes see Moose::Manual::Attributes. For additional lesser used workbook options see Secondary Methods. There are several grouped default values for these attributes documented in the Flags section.

Example

        $workbook_instance = Spreadsheet::Reader::ExcelXML->new( %attributes )

note: if the file information is not included in the initial %attributes then it must be set by one of the attribute setter methods below or the parse method before the rest of the package can be used.

file

Definition: This attribute holds the file handle for the top level workbook. If a file name is passed it is coerced into an IO::File handle and stored that way. The originaly file name can be retrieved with the method file_name.

Default no default

Required: yes

Range any unencrypted xlsx|xlsm|xml file that can be opened in Microsoft Excel 2007+.

attribute methods Methods provided to adjust this attribute

set_file( $file|$file_handle )

Definition: change the file value in the attribute (this will reboot the workbook instance)

error_inst

Definition: This attribute holds an 'error' object instance. It should have several methods for managing errors. Currently no error codes or error language translation options are available but this should make implementation of that easier.

Default: a Spreadsheet::Reader::ExcelXML::Error instance with the attributes set as;

        ( should_warn => 0 )

Range: See the 'Exported methods' section below for methods required by the workbook. The error instance must also be able to extract the error string from a passed error object as well. For now the current implementation will attempt ->as_string first and then ->message if an object is passed.

attribute methods Methods provided to manage this attribute

get_error_inst

Definition: returns this instance

has_error_inst

Definition: indicates in the error instance has been set

Exported methods:

The following methods are exported (delegated) to the workbook level from the stored instance of this class. Links are provided to the default implemenation;

"error" in Spreadsheet::Reader::ExcelXML::Error

"set_error" in Spreadsheet::Reader::ExcelXML::Error

"clear_error" in Spreadsheet::Reader::ExcelXML::Error

"set_warnings" in Spreadsheet::Reader::ExcelXML::Error

"if_warn" in Spreadsheet::Reader::ExcelXML::Error

"should_spew_longmess" in Spreadsheet::Reader::ExcelXML::Error

"spewing_longmess" in Spreadsheet::Reader::ExcelXML::Error

"has_error" in Spreadsheet::Reader::ExcelXML::Error

formatter_inst

Definition: This attribute holds a 'formatter' object instance. This instance does all the heavy lifting to transform raw text into desired output. It does include a role that interprets the excel format string into a Type::Tiny coercion. The default case is actually built from a number of different elements using MooseX::ShortCut::BuildInstance on the fly so you can just call out the replacement base class or role rather than fully building the formatter prior to calling new on the workbook. However the naming of the interface |http://www.cs.utah.edu/~germain/PPS/Topics/interfaces.html> is locked and should not be tampered with since it manages the methods to be imported into the workbook;

Default An instance built with MooseX::ShortCut::BuildInstance from the following arguments (note the instance itself is not built here) { superclasses => ['Spreadsheet::Reader::ExcelXML::FmtDefault'], # base class add_roles_in_sequence =>[qw( Spreadsheet::Reader::ExcelXML::ParseExcelFormatStrings # role containing the heavy lifting methods Spreadsheet::Reader::ExcelXML::FormatInterface # the interface )], package => 'FormatInstance', # a formality more than anything }

Range: A replacement formatter instance or a set of arguments that will lead to building an acceptable formatter instance. See the 'Exported methods'section below for all methods required methods for the workbook. The FormatInterface is required by name so a replacement of that role requires the same name.

attribute methods Methods provided to manage this attribute

get_formatter_inst

Definition: returns the stored formatter instance

set_formatter_inst

Definition: sets the formatter instance

Exported methods:

Additionally the following methods are exported (delegated) to the workbook level from the stored instance of this class. Links are provided to the default implemenation;

Example: name_the_workbook_uses_to_access_the_method => Link to the default source of the method

get_formatter_region => "get_excel_region" in Spreadsheet::Reader::Format::FmtDefault

has_target_encoding => "has_target_encoding" in Spreadsheet::Reader::Format::FmtDefault

get_target_encoding => "get_target_encoding" in Spreadsheet::Reader::Format::FmtDefault

set_target_encoding => "set_target_encoding( $encoding )" in Spreadsheet::Reader::Format::FmtDefault

change_output_encoding => "change_output_encoding( $string )" in Spreadsheet::Reader::Format::FmtDefault

set_defined_excel_formats => "set_defined_excel_formats( %args )" in Spreadsheet::Reader::Format::FmtDefault

get_defined_conversion => "get_defined_conversion( $position )" in Spreadsheet::Reader::Format::ParseExcelFormatStrings

parse_excel_format_string => "parse_excel_format_string( $string, $name )" in Spreadsheet::Reader::Format::ParseExcelFormatStrings

set_date_behavior => "set_date_behavior( $bool )" in Spreadsheet::Reader::Format::ParseExcelFormatStrings

set_european_first => "set_european_first( $bool )" in Spreadsheet::Reader::Format::ParseExcelFormatStrings

set_formatter_cache_behavior => "set_cache_behavior( $bool )" in Spreadsheet::Reader::Format::ParseExcelFormatStrings

set_workbook_for_formatter => "set_workbook_inst( $instance )" in Spreadsheet::Reader::Format::ParseExcelFormatStrings

count_from_zero

Definition: Excel spreadsheets count from 1. Spreadsheet::ParseExcel counts from zero. This allows you to choose either way.

Default 1

Range 1 = counting from zero like Spreadsheet::ParseExcel, 0 = Counting from 1 like Excel

attribute methods Methods provided to adjust this attribute

counting_from_zero

Definition: a way to check the current attribute setting

file_boundary_flags

Definition: When you request data to the right of the last column or below the last row of the data this package can return 'EOR' or 'EOF' to indicate that state. This is especially helpful in 'while' loops. The other option is to return 'undef'. This is problematic if some cells in your table are empty which also returns undef. The determination for what constitues the last column and row is selected with the attributes empty_is_end, values_only , and spaces_are_empty.

Default 1

Range 1 = return 'EOR' or 'EOF' flags as appropriate, 0 = return undef when requesting a position that is out of bounds

attribute methods Methods provided to adjust this attribute

boundary_flag_setting

Definition: a way to check the current attribute setting

empty_is_end

Definition: The excel convention is to read the table left to right and top to bottom. Some tables have an uneven number of columns with real data from row to row. This allows the several methods that excersize a 'next' function to wrap after the last element with data rather than going to the max column. This also can combine with the attribute file_boundary_flags to trigger 'EOR' flags after the last data element and before the sheet max column when not implementing 'next' functionality. It will also return 'EOF' if the remaining rows are empty even if the max row is farther on.

Default 0

Range 0 = treat all columns short of the max column for the sheet as being in the table, 1 = treat all cells after the last cell with data as past the end of the row. This will be most visible when boundary flags are turned on or next functionality is used in the context of the attribute values_only.

attribute methods Methods provided to adjust this attribute

is_empty_the_end

Definition: a way to check the current attribute setting

values_only

Definition: Excel will store information about a cell even if it only contains formatting data. In many cases you only want to see cells that actually have values. This attribute will change the package behaviour regarding cells that have formatting stored against that cell but no actual value. If values in the cells exist as zero length strings or spaces only you can also set those to empty with the attribute spaces_are_empty.

Default 0

Range 1 = return 'undef' for cells with formatting only, 0 = return the result of empty_return_type (or cell objects) for cells that only contain formatting.

attribute methods Methods provided to adjust this attribute

get_values_only

Definition: a way to check the current attribute setting

from_the_edge

Definition: Some data tables start in the top left corner. Others do not. I don't reccomend that practice but when aquiring data in the wild it is often good to adapt. This attribute sets whether the file percieves the min_col and min_row as the top left edge of the sheeto or from the top row with data and starting from the leftmost column with data.

Default 1

Range 1 = treat the top left corner of the sheet as the beginning of rows and columns even if there is no data in the top row or leftmost column, 0 = Set the minimum row and minimum columns to be the first row and first column with data

attribute methods Methods provided to adjust this attribute

starts_at_the_edge

Definition: returns the attribute state

cache_positions

Definition: Using the standard architecture this parser would go back and read the sharedStrings and styles files sequentially from the beginning each time it had to access a sub elelement. This trade-off is generally not desired for these two files since the data is generally stored in a less than sequential fasion. The solution is to cache these files as they are read the first time so that a second pass through is not necessary to retreive an earlier element. The only time this doesn't make sence is if either of the files would overwhelm RAM if cached. The package has file size break points below which the files will cache. The thinking is that above these points the RAM is at risk of being overwhelmed and that not crashing and slow is better than a possible out-of-memory state. This attribute allows you to change those break points based on the target machine you are running on. The breaks are set on the byte size of the sub file not on the cached expansion of the sub file. In general the styles file is cached into a hash and the shared strings file is cached into an array ref. The attribute group_return_type also affects the size of the cache for the sharedStrings file since it will not cache the string formats unless the attribute is set to 'instance'. There is also a setting for caching worksheet data. Some worksheet row position settings will always be cached in order to speed up multiple reads over the same sheet or to query meta data about the rows. However, this cache level is set lower since the row caching creates much deeper data structures.

Default

        {
                shared_strings_interface => 5242880,# 5 MB
                styles_interface => 5242880,# 5 MB
                worksheet_interface => 2097152,# 2 MB
        }

attribute methods Methods provided to adjust this attribute

cache_positions

Definition: returns the full attribute settings as a hashref

get_cache_size( (shared_strings_interface|styles_interface|worksheet_interface) )

Definition: return the max file size allowed to cache for the indicated interface

set_cache_size( $target_interface = $max_file_size )>

Definition: set the $max_file_size in bytes to be cached for the indicated $target_interface

has_cache_size( $target_interface )

Definition: returns true if the $target_interface has a cache size set

show_sub_file_size

Definition: Especially for zip (xlsx and xlsm) files you may not know how big the file is and want to the package to tell you what size it thinks the file is. This attribute turns on a warning statment that prints to STDERR with information on the size of potientially cached files.

Default 0

Range 0 = don't warn the file size, 1 = send the potentially cached file sizes to STDERR for review

group_return_type

Definition: Traditionally ParseExcel returns a cell object with lots of methods to reveal information about the cell. In reality the extra information is not used very much (witness the popularity of Spreadsheet::XLSX). Because many users don't need or want the extra cell formatting information it is possible to get either the raw xml value, the raw visible cell value (seen in the Excel format bar), or the formatted cell value returned either the way the Excel file specified or the way you specify instead of a Cell instance with all the data. All empty cells return undef no matter what.

Default instance

Range instance = returns a populated Spreadsheet::Reader::ExcelXML::Cell instance, xml_value = returns the string stored in the xml file - for xml based sheets this can sometimes be different thant the visible value in the cell or formula bar. unformatted = returns just the raw visible value of the cell shown in the Excel formula bar, value = returns just the formatted value stored in the excel cell

attribute methods Methods provided to adjust this attribute

get_group_return_type

Definition: a way to check the current attribute setting

empty_return_type

Definition: Traditionally Spreadsheet::ParseExcel returns an empty string for cells with unique formatting but no stored value. It may be that the more accurate way of returning undef works better for you. This will turn that behaviour on.

Default empty_string

Range empty_string = populates the unformatted value with '' even if it is set to undef undef_string = if excel stores undef for an unformatted value it will return undef

attribute methods Methods provided to adjust this attribute

get_empty_return_type

Definition: a way to check the current attribute setting

spread_merged_values

Definition: In Excel you visibly see the value of the primary cell in a merged range displayed in all the cells. This attribute lets the code see the primary value show in each of the merged cells. There is some mandatory caching to pull this off so it will consume more memory.

Default 0 (To match the Excel formula bar, VBscript, and Spreadsheet::ParseExcel)

Range 0 = don't spread the primary value, 1 = spread the primary value

attribute methods Methods provided to adjust this attribute

spreading_merged_values

Definition: a way to check the current attribute setting

skip_hidden

Definition: Like the previous attribute this attempts to match a visual effect in Excel. Even though hidden cells still contain values you can't see them visibly. This allows you to skip hidden rows and columns (not hidden sheets). The one gotcha is Excel will place the primary value in the new primary merged cell (formula bar) if a merge range is only partially obscured to include the original primary cell. This package can't do that. Either spread the primary to all cells or none.

Default 0 (To match VBscript and Spreadsheet::ParseExcel)

Range 0 = don't skip hidden rows and columns, 1 = skip hidden rows and columns

attribute methods Methods provided to adjust this attribute

should_skip_hidden

Definition: a way to check the current attribute setting

spaces_are_empty

Definition: Some auto file generators tend to add empty strings or strings with spaces to fill empty cells. There may be some visual value in this but they can slow down parsing scripts. this attribute allows the sheet to treat spaces as empty or undef instead of cells with values.

Default 0 (To match Excel and Spreadsheet::ParseExcel)

Range 0 = cells with zero length strings and spaces are considered to have 'values", 1 = There must be something other than spaces or a zero length string for the cell to have value.

attribute methods Methods provided to adjust this attribute

are_spaces_empty

Definition: a way to check the current attribute setting

merge_data

Definition: For zip based worksheets the merge data is stored at the end of the file. In order for the parser to arrive at that point it has to read through the whole sheet first. For big worksheet files this is very slow. If you are willing to not know or implement cell merge information then turn this off and the sheet should load much faster.

Default 1 (collect merge data)

Range 1 = The merge data is parsed from the worksheet file when it is opened, 0 = No merge data is parsed. The effect is equal to the cell merges dissapearing.

attribute methods Methods provided to adjust this attribute

collecting_merge_data

Definition: a way to check the current attribute setting

FLAGS

The parameter list (Attributes) that are possible to pass to ->new is somewhat long. Therefore you may want a shortcut that aggregates some set of attribute settings that are not the defaults but wind up being boilerplate. I have provided possible alternate sets like this and am open to providing others that are suggested. The flags will have a : in front of the identifier and will be passed to the class in the 'use' statement for consumption by the import method. The flags can be stacked and where there is conflict between the flag settings the rightmost passed flag setting is used. If everything in the flag but one or two settings are desirable still use the flag and then overwrite those settings when calling new.

Example;

        use Spreadsheet::Reader::ExcelXML v0.2 qw( :alt_default :debug );

:alt_default

This is intended for a deep look at data and skip formatting cells.

Default attribute differences

values_only => 1

count_from_zero => 0

empty_is_end => 1

:just_the_data

This is intended for a shallow look at data with value formatting implemented

Default attribute differences

count_from_zero => 0

values_only => 1

empty_is_end => 1

group_return_type => 'value'

from_the_edge => 0

empty_return_type => 'undef_string'

spaces_are_empty => 1

merge_data => 0

column_formats => 0

:just_raw_data

This is intended for a shallow look at raw text and skips all formatting including number formats.

Default attribute differences

count_from_zero => 0

values_only => 1

empty_is_end => 1

group_return_type => 'xml_value'

from_the_edge => 0,

empty_return_type => 'undef_string'

spaces_are_empty => 1

merge_data => 0

column_formats => 0

:like_ParseExcel

This is a way to force some of the other groups back to instance and count from zero

Default attribute differences

count_from_zero => 1

group_return_type => 'instance'

:debug

This is a way to turn on as much reporting as possible

Default attribute differences

error_inst ->

        error_inst =>{
                superclasses => ['Spreadsheet::Reader::ExcelXML::Error'],
                package => 'ErrorInstance',
                should_warn => 1,
        }

show_sub_file_size => 1

:lots_of_ram

This opens the caching size allowances way up

Default attribute differences

cache_positions ->

        cache_positions =>{
                shared_strings_interface => 209715200,# 200 MB
                styles_interface => 209715200,# 200 MB
                worksheet_interface => 209715200,# 200 MB
        },

:less_ram

This tightens caching size allowances way down

Default attribute differences

cache_positions ->

        cache_positions =>{
                shared_strings_interface => 10240,# 10 KB
                styles_interface => 10240,# 10 KB
                worksheet_interface => 1024,# 1 KB
        },

Secondary Methods

These are additional ways to use this class. They can be used to open an .xlsx workbook. They are also ways to investigate information at the workbook level. For information on how to retrieve data from the worksheets see the Worksheet and Cell documentation. For additional workbook options see the Secondary Methods and the Attributes sections. The attributes section specifically contains all the methods used to adjust the attributes of this class.

All methods are object methods and should be implemented on the object instance.

Example:

        my @worksheet_array = $workbook_instance->worksheets;

get_epoch_year

Definition: This returns the epoch year defined by the Excel workbook. The epoch year affects the way dates are processed in the formatter Spreadsheet::Reader::Format

Accepts: nothing

Returns: 1900 = Windows Excel or 1904 = Apple Excel

has_epoch_year

Definition: This indicates if an epoch_year has been determined for the workbook (yet)

Accepts: nothing

Returns: 1 = yes there is one, 0 = nothing (yet)

get_sheet_name( $Int )

Definition: This method returns the sheet name for a given physical position in the workbook from left to right. It counts from zero even if the workbook is in 'count_from_one' mode. B(It will return chart names but chart tab names cannot currently be converted to worksheets). You may actually want worksheet_name instead of this function.

Accepts: integers

Returns: the sheet name (both worksheet and chartsheet )

sheet_count

Definition: This returns the total number of recorded sheets

Accepts: nothing

Returns: $total - a count of all sheets (including chartsheets and worksheets)

get_sheet_info( $name )

Definition: This returns any stored metadata about the sheet in a hashref

Accepts: $name

Returns: a hashref of sheet metadata ( a pretty thin list still )

get_rel_info( $relId )

Definition: This returns the sheet name for the $relId

Accepts: $relId ex; 'rId5'

Returns: The sheet $name associated with that relId

get_id_info( $Id )

Definition: This returns the sheet name for the $Id

Accepts: $Id (an integer) ex; '2'

Returns: The sheet $name associated with that Id

get_chartsheet_names

Definition: This method returns an array ref of all the chartsheet names (tabs) in the workbook in order. (No worksheets.)

Accepts: nothing

Returns: an array ref of strings

chartsheet_name( $position )

Definition: This returns the name of the chartsheet in that $position. (counting from zero) interspersed worksheets in the file are not considered to hold a position by this accounting.

Accepts: $position (an integer)

Returns: the chartsheet name

chartsheet_count

Definition: This returns the total number of recorded chartsheets

Accepts: nothing

Returns: $total - a count of all chartsheets (only)

creator

Definition: Returns the recorded creator of the file from the parsed metadata

Accepts: nothing

Returns: a string

modified_by

Definition: Returns the recorded last entity to modify the file from the parsed metadata

Accepts: nothing

Returns: a string

date_created

Definition: Returns the date that Excel recorded for the file creation

Accepts: nothing

Returns: a string (YYYY-MM-DD)

date_modified

Definition: Returns the date that Excel recorded for the last file modification

Accepts: nothing

Returns: a string (YYYY-MM-DD)

in_the_list

Definition: This is a predicate method that indicates if the 'next' worksheet function has been implemented at least once.

Accepts:nothing

Returns: true = 1, false = 0

start_at_the_beginning

Definition: This restarts the 'next' worksheet at the first worksheet. This method is only useful in the context of the worksheet function.

Accepts: nothing

Returns: nothing

Architecture Choices

This is yet another package for parsing Excel xml or 2007+ (and 2003+ xml) workbooks. There are two other options for 2007+ XLSX parsing (but not 2003 xml parsing) on CPAN. (Spreadsheet::ParseXLSX and Spreadsheet::XLSX) In general if either of them already work for you without issue then there is probably no compelling reason to switch to this package. However, the goals of this package which may provide differentiation are five fold. First, as close as possible produce the same output as is visible in an excel spreadsheet with exposure to underlying settings from Excel. Second, adhere as close as is reasonable to the Spreadsheet::ParseExcel API (where it doesn't conflict with the first objective) so that less work would be needed to integrate ParseExcel and this package. An addendum to the second goal is this package will not expose elements of the object hash for use by the consuming program. This package will either return an unblessed hash with the equivalent elements to the Spreadsheet::ParseExcel output (instead of a class instance) or it will provide methods to provide these sets of data. The third goal is to read the excel files in a 'just in time' manner without storing all the data in memory. The intent is to minimize the footprint of large file reads. Initially I did this using XML::LibXML but it eventually proved to not play well with Moose ( or perl? ) garbage collection so this package uses a pure perl xml parser. In general this means that design decisions will generally sacrifice speed to keep RAM consumption low. Since the data in the sheet is parsed just in time the information that is not contained in the primary meta-data headers will not be available for review until the sheet parses to that point. In cases where the parser has made choices that prioritize speed over RAM savings there will generally be an attribute available to turn that decision off. Fourth, Excel files get abused in the wild. In general the Microsoft (TM) Excel application handles these mangled files gracefully. The goal is to be able to read any xml based spreadsheet Excel can read from the supported extention list. Finally, this parser supports the Excel 2003 xml format. All in all this package solves many of the issues I found parsing Excel in the wild. I hope it solves some of yours as well.

Warnings

1.This package uses Archive::Zip. Not all versions of Archive::Zip work for everyone. I have tested this with Archive::Zip 1.30. Please let me know if this does not work with a sucessfully installed (read passed the full test suit) version of Archive::Zip newer than that.

2. Not all workbook sheets (tabs) are created equal! Some Excel sheet tabs are only a chart. These tabs are 'chartsheets'. The methods with 'worksheet' in the name only act on the sub set of tabs that are worksheets. Future methods with 'chartsheet' in the name will focus on the subset of sheets that are chartsheets. Methods with just 'sheet' in the name have the potential to act on both. The documentation for the chartsheet level class is found in Spreadsheet::Reader::ExcelXML::Chartsheet (still under construction). All chartsheet classes do not provide access to cells.

3. This package supports reading xlsm files (Macro enabled Excel 2007+ workbooks). xlsm files allow for binaries to be embedded that may contain malicious code. However, other than unzipping the excel file no work is done by this package with the sub-file 'vbaProject.bin' containing the binaries. This package does not provide an API to that sub-file and I have no intention of doing so. Therefore my research indicates there should be no risk of virus activation while parsing even an infected xlsm file with this package but I encourage you to use your own judgement in this area. caveat utilitor!

4. This package will read some files with 'broken' xml. In general this should be transparent but in the case of the maximum row value and the maximum column value for a worksheet it can cause some surprising problems. This includes the possibility that the maximum values are initially stored as 'undef' if the sheet does not provide them in the metadata as expected. These values are generally never available in Excel 2003 xml files. The answer to the methods "row_range" in Spreadsheet::Reader::ExcelXML::Worksheet and "col_range" in Spreadsheet::Reader::ExcelXML::Worksheet will then change as more of the sheet is parsed. You can use the attribute file_boundary_flags or the methods "get_next_value" in Spreadsheet::Reader::ExcelXML::Worksheet or "fetchrow_arrayref" in Spreadsheet::Reader::ExcelXML::Worksheet as alternates to pre-testing for boundaries when iterating. The primary cause of these broken XML elements in Excel 2007+ files are non-XML applications writing to or editing the underlying xml. If you have an example of other broken xml files readable by the Excel application that are not parsable by this package please submit them to my github repo so I can work to improve this package. If you don't want your test case included with the distribution I will use it to improve the package without publishing it.

5. I reserve the right to tweak the sub file caching breakpoints over the next few releases. The goal is to have a default that appears to be the best compromise by 2017-1-1.

6. This package provides support for SpreadsheetML (Excel 2003) .xml extention documents. These files should include the header;

        <?mso-application progid="Excel.Sheet"?>

to indicate their intended format. Please submit any cases that appear to behave differently than expected for .xml extention files that are readable by the Excel application. I am also interested in cases where an out of memory error occurs with an .xml extension file. This warning will stay till 2017-1-1.

7. This package uses two classes at the top to handle cleanup for some self referential object organization that I use. As a result the action taken on this package is (mostly) implemented in Spreadsheet::Reader::ExcelXML::Workbook code. I documented most of that code API here. If you want to look at the raw code go there.

BUILD / INSTALL from Source

0. Using cpanm is much easier than a source build!

        cpanm Spreadsheet::Reader::ExcelXML

And then if you feel kindly App::cpanminus::reporter

        cpanm-reporter

1. Download a compressed file with this package code from your favorite source

github

Meta::CPAN

CPAN

2. Extract the code from the compressed file.

If you are using tar on a .tar.gz file this should work:

        tar -zxvf Spreadsheet-Reader-ExcelXML-v0.xx.tar.gz

3. Change (cd) into the extracted directory

4. Run the following

(for Windows find what version of make was used to compile your perl)

        perl  -V:make

(then for Windows substitute the correct make function (s/make/dmake/g)? below)

        perl Makefile.PL

        make

        make test

        make install # As sudo/root

        make clean

SUPPORT

github Spreadsheet::Reader::Format/issues

TODO

1. Write a chartsheet parser and functions

2. Add a pivot table reader (Not just read the values from the sheet)

3. Add calc chain methods

4. Add more exposure to workbook/worksheet formatting values

AUTHOR

Jed Lund

jandrew@cpan.org

CONTRIBUTORS

This is the (likely incomplete) list of people who have helped make this distribution what it is, either via code contributions, patches, bug reports, help with troubleshooting, etc. A huge 'thank you' to all of them. Most were contributors to Spreadsheet::XLSX::Reader::LibXML but the contributions have (hopefully) not been lost.