NAME

Spreadsheet::Reader::ExcelXML::XMLReader - A minimal pure-perl xml reader class

SYNOPSIS

        package MyPackage;
        use MooseX::StrictConstructor;
        use MooseX::HasDefaults::RO;
        # You have to 'use' or build a the Workbook here or the XMLReader won't load
        #  -> because the reader uses a regex to scrap imported methods
        use Spreadsheet::Reader::ExcelXML::Workbook;
        extends 'Spreadsheet::Reader::ExcelXML::XMLReader';

DESCRIPTION

This documentation is written to explain ways to use this module when writing your own excel spreadsheet parser. I suppose the class could be used more generally but that's not why I wrote it and for now I have no intention of providing a full xml toolbox. For Excel spreadsheet parsing generally please start at the top level documentation. Workbooks, Worksheets, and Cells.

This class is meant to be used as the base reading class for specific types of xml files. The reader for those specific files will include roles that are useful for that files content. When the file first loads it will store some available information from the header (?) nodes and move to the first file node. At that point it will check if any of the consuming roles have a method '_load_unique_bits' If so it will call that method for additional meta data collection by that role.

This class will process the xml file in a just in time fashion holding enough information to know the level and open nodes not yet closed but nothing else. The intent is to use a little RAM as possible and process the file in the most (pure perl) computationaly efficient way possible. I welcome all suggestions for improvement.

Attributes

Data passed to new when creating an instance. For modification of these attributes see the listed 'attribute methods'. For general information on attributes see Moose::Manual::Attributes. For ways to manage the instance after it is opened see the Methods.

file

Definition: This attribute holds the file handle for the file being read. If the full file name and path is passed to the attribute the class will coerce that into an IO::File file handle.

Default: no default - this must be provided to read a file

Required: yes

Range: any unencrypted xml file name and path or IO::File file handle set to read.

attribute methods Methods provided to adjust this attribute

set_file

Definition: change the file value in the attribute (this will reboot the file instance and lock the file)

get_file

Definition: Returns the file handle of the file even if a file name was passed

has_file

Definition: this is used to see if the file loaded correctly.

clear_file

Definition: this clears (and unlocks) the file handle

Delegated Methods

closes the file handle

seek

allows seek commands to be passed to the file handle

getline

returns the next line of the file handle with '<' set as the input_record_separator ($/)

workbook_inst

Definition: This attribute holds a reference to the top level workbook (parser). The purpose is to use some of the methods provided there.

Default: no default

Required: not strictly for this class but the attribute is provided to give self referential access to general workbook settings and methods for composed classes that inherit this a base class.

Range: isa => 'Spreadsheet::Reader::ExcelXML::Workbook'

attribute methods Methods provided to adjust this attribute

set_workbook_inst

set the attribute with a workbook instance

Delegated Methods (required) Methods delegated to this module by the attribute. All methods are delegated with the method name unchanged. Follow the link to review documentation of the provider for each method. As you can see several are delegated through the Workbook level and don't originate there.

"get_group_return_type" in Spreadsheet::Reader::ExcelXML

"counting_from_zero" in Spreadsheet::Reader::ExcelXML

"are_spaces_empty" in Spreadsheet::Reader::ExcelXML

"has_shared_strings_interface" in Spreadsheet::Reader::ExcelXML

"should_skip_hidden" in Spreadsheet::Reader::ExcelXML

"spreading_merged_values" in Spreadsheet::Reader::ExcelXML

"starts_at_the_edge" in Spreadsheet::Reader::ExcelXML

"get_empty_return_type" in Spreadsheet::Reader::ExcelXML

"get_values_only" in Spreadsheet::Reader::ExcelXML

"get_epoch_year" in Spreadsheet::Reader::ExcelXML

"get_error_inst" in Spreadsheet::Reader::ExcelXML

"has_styles_interface" in Spreadsheet::Reader::ExcelXML

"boundary_flag_setting" in Spreadsheet::Reader::ExcelXML

"is_empty_the_end" in Spreadsheet::Reader::ExcelXML

"get_rel_info" in Spreadsheet::Reader::ExcelXML

"get_sheet_info" in Spreadsheet::Reader::ExcelXML

"get_sheet_names" in Spreadsheet::Reader::ExcelXML

"collecting_merge_data" in Spreadsheet::Reader::ExcelXML

"collecting_column_formats" in Spreadsheet::Reader::ExcelXML

"set_error( $error_string )" in Spreadsheet::Reader::ExcelXML::Error

"get_defined_conversion( $position )" in Spreadsheet::Reader::Format

"set_defined_excel_formats( %args )" in Spreadsheet::Reader::Format

"parse_excel_format_string( $string, $name )" in Spreadsheet::Reader::Format

"change_output_encoding( $string )" in Spreadsheet::Reader::Format

"get_shared_string( $positive_int|$name )" in Spreadsheet::Reader::ExcelXML::SharedStrings

"get_format( ($position|$name), [$header], [$exclude_header] )" in Spreadsheet::Reader::ExcelXML::Styles

xml_version

Definition: This stores the xml version read from the xml header. It is read when the file handle is first set in this sheet.

Default: no default - this is auto read from the header

Required: no

Range: xml versions

attribute methods Methods provided to adjust this attribute

version

get the stored xml version

xml_encoding

Definition: This stores the data encoding of the xml file from the xml header. It is read when the file handle is first set in this sheet.

Default: no default - this is auto read from the header

Required: no

Range: valid xml file encoding

attribute methods Methods provided to adjust this attribute

encoding

get the attribute value

has_encoding

predicate for the attribute value

xml_progid

Definition: This is an attribute found in a secondary xml header that is associated with Excel 2003 xml based files. The value can be tested to see if the file was intended to be compliant with that format.

Default: no default - this is auto read from the header

Required: no

Range: a string

attribute methods Methods provided to adjust this attribute

progid

get the attribute value

has_progid

predicate for the attribute value

xml_header

Definition: This stores the primary xml header string from the xml file. It is read when the file handle is first set in this sheet. I contains both the verion and the encoding where available and is used when building subsets of the file as standalone xml.

Default: no default - this is auto read from the header

Required: no

Range: valid xml file header

attribute methods Methods provided to adjust this attribute

get_header

get the attribute value

_set_xml_header

set the attribute value

xml_doctype

Definition: This stores the DOCTYPE indicated in the XML header !DOCTYPE

Default: no default - this is auto read from the header

Required: no

Range: whatever it finds

attribute methods Methods provided to adjust this attribute

doctype

get the attribute value

has_doctype

predicate for the attribute

position_index

Definition: This attribute is available to facilitate other consuming roles and classes. Of this attributes methods only the 'clear_location' method is used in this class during the start_the_file_over method. It can be used for tracking positions with the same node name.

Default: no default - this is mostly managed by the role or child class

Required: no

Range: Integer

attribute methods Methods provided to adjust this attribute

where_am_i

get the attribute value

i_am_here

set the attribute value

clear_location

clear the attribute value

has_position

set the attribute value

file_type

Definition: This is a static attribute that shows the file type

Default: xml

attribute methods Methods provided to adjust this attribute

get_file_type

get the attribute value

stacking

Definition: a pure perl xml parser will in general be slower than the C equivalent. To provide some acceleration to arrive at a target destination you can turn of the stack trace which will include building and storing the trace elements. This breaks things so don't do it without a solid understanding of what is happening. For instance if you turn this off and then call the method parse_element The parse_element method will have to turn the stack trace back on on it's own to build the element tree. The issue is that the most recent element at the base of the tree won't be available to build from. You will need to manually build it and push it to the stack. See the methods initial_node_build and add_node_to_stack to implement this.

Default: 1 = the stack trace is on

attribute methods Methods provided to adjust this attribute

should_be_stacking

get the attribute value

change_stack_storage_to( $Bool )

Turn the stack trace(r) state to $Bool (1 = on)

Methods

These are the methods provided by this class.

start_the_file_over

Definition: Clears the position_index, the old stack trace, and kick starts stack trace tracking again. It then uses seek(0, 0) to reset the file handle to the beginning. Finally, it reads the file until it gets to the first non-xml header node.

Accepts: nothing

Returns: nothing

good_load( $state )

Definition: a setter method to indicated if the file loaded correctly. This generally should be set by consuming roles in the load_unique_bits phase.

Accepts: (1|0)

Returns: nothing

loaded_correctly

Definition: a getter method to understand if the file loaded correctly. This is generally used by consumers of the instance to see if there was any trouble during the initial build.

Accepts: nothing

Returns: 1 = good build, 0 = bad_build

parse_element( [$depth] )

Definition: This will read and store the full node from the current position down to an optional $depth. When the parse is complete the parser will be positioned at the beginning of the next node. The node does not include the top name but will include attributes.

Accepts: $depth = optional

Returns: A perl hash reference where all nodes at a level are listed using three hashref keys; list_keys, list, and attributes. The 'attributes' key points to a hash reference containing that nodes attributes. The 'list_keys' key points to an array reference with all the node names for each node at the next level down. The 'list' key points to an array reference of nodes or node values matching the position of the list_keys. There are two special case exceptions to this. First, for text values the node is listed as { raw_text => 'text node content' }. Second, if the attributes only include a 'val' key the node stores this under the 'val' key rather than the 'attributes' key with a sub key 'val'.

advance_element_position( $element, [$iterations] )

Definition: This will move the xml file reader forward until it finds the identified named $element. If the reader is already at an element of that name it will index forward until it finds the next $element of that name. If the optional positive $iterations integer is passed it will index to the named $element - $iterations times.

Accepts: $element = a case sensitive xml node name found forward of the current position in the file. [$iterations] = optional a positive integer indicating how many times to index forward to the named $element.

Returns: a list of 4 positions ( $success, $node_name, $node_level, $return_node_ref )

$success = a boolean value indicating whether the desired goal was met, $node_name = the actual node name for the final position (should match $element if $success), $node_level = the level of the final named node in the stack( not the sub text node ) $return_node_ref = When the stacking attribute is on this returns the last displaced elements in the stack displaced by the traverse of the xml tree. When stacking is off this returns an array ref of values used as the second argument in initial_node_build.

next_sibling

Definition: This will move the xml file reader forward until it finds next node at the same level as the current node within the same supernode. If this method finds a higher node prior to finding a node at the same level it will return failure and stop reading.

Accepts: nothing

Returns: a list of 4 positions ( $success, $node_name, $node_level, $return_node_ref )

skip_siblings

Definition: This will move the xml file reader forward until it finds next node higher. It will not stop on end nodes so it will continue to pass all closed nodes until it comes to the first open or self contained node above the current node.

Accepts: nothing

Returns: a list of 4 positions ( $success, $node_name, $node_level, $return_node_ref )

current_named_node

Definition: when processing xml files in a just in time fashion there will be some ambiguity surrounding text nodes;

        <t>sometext</t>
        <s>
           <r val="2"/>

In the 't' node example the content between the '>' character and the '<' characters are intentional and valuable to the data set. In the 's' and 'r' node example the space between those characters is only intended for human readability. This parser will not be able to tell the value of the content after the 's' node '>' character until the 'r' node is read. At that point the 's' node will no longer be the 'current' position. To resolve this, all content other than '' between '>' and '<' is treated as a node until the next node is read. Because these nodes are ambiguous the idea of a 'named node' is valuable and knowing what the most recent named node is can be useful. This method either returns the last read node or the second to last node if the last node is a raw text node. In the first example it would return the 't' node and in the second example it would return the 's' node.

Accepts: nothing

Returns: a hash ref of information about the node containing the following keys;

        level => counting from 0 at the start of the file and moving up
        type => regular = xml named node|#text = node built from the contents between the > and < characters
        name => the xml node name (for #text nodes this is 'raw_text')
        closed => (closed|open) depending on the current tag state
        initial_string => The string inside the < > quotes prior to parsing
        [attributes] => all attributes and values will be stored under the attribute name
        [val] => special case storage of one attribute

squash_node( $node )

Definition: This takes a $node from the parse_element output and turns it into a more perl like reference. It checks the list_keys and if there are any duplicates it takes the list values and uses them as elements of an array ref assigned to a hash key called list. If there are no duplicates in the list_keys it turns the list_keys into hash keys with the list elements assigned as values. It then takes the attributes and mingles them in the hashref with the prior results. There are two special cases for a node reorganization. For nodes with a 'val' in the 'list_keys' then the element in the same position of the 'list' is returned as the whole ref. If there is a raw_text node it is returned as a hashref with one key 'raw_text' with the text itself as the value. This is all done recursivly so lower layers are assigned to upper layers using the rules above.

Accepts: the output of a parse_element call

Returns: a perl data structure with the xml organization removed

extract_file( @node_list )

Definition: This will build an xml file and load it to a IO::Handle->new_tmpfile object. The xml is built on whole extracted xml strings defined by @node_list. If none of the node list elements is found in the parsed file then the first listed element from the node list will be used to create an empty self closing node.

Accepts: @node_list = Node list items can either be xml node name strings or array refs composed of two elements, first the node name and second the iterated position. Ex.

        @node_list_example = ( 'r', [ 'si', 3 ] );

In this example the extracted file would contain the first 'r' node and the 3rd 'si' node.the output of a parse_element call. There is the exception case where you just want the whole file passed. The out here is to pass 'ALL_FILE' as the first element of the @node_list and a complete copy of the file_handle in read mode will be passed.

Returns: a File::Temp file handle loaded with an xml header and the listed nodes.

current_node_parsed

Definition: When nodes are read they are not completely processed to save cycles. If you want a fully processed result from the current node position including any embedded text then this is the method for you.

Accepts: Nothing

Returns: a perl ref equivalent to the squash_node call. This only returns the fully processed current_named_node and any sub text nodes.

close_the_file

Definition: It may be that the file(handle) may not be needed during the whole workbook parse. If so you can use this method to close (and clear / release) an open file handle as appropriate.

Accepts: Nothing

Returns: Nothing (the file handle is closed and cleared)

not_end_of_file

Definition: This is a poor mans End Of File test (EOF). The reader builds a node stack to keep track of where it is in the xml parse and when it runs out of nodes it means you are back at the top of the stack.

Accepts: Nothing

Returns: a count of the nodes in the node stack (header nodes are processed early on and are read and removed as part of startup)

initial_node_build( $node_name, $attribute_list_ref )

Definition: Generally this is an internal method and should not be used. However, in order to provide a faster forward ability the node stack trace(ing) can be turned off. When you want to turn it back on you have to manually build the top node using this method and store it to the node stack using add_node_to_stack . This method will build the essentials for adding to the node stack. Please not that it will not necessarily get the node level right. If you need that to be correct then don't turn off the stack trace. It will not build raw_text nodes correctly.

Accepts: $node_name = a string without spaces for the name of the node, $attribute_list_ref = This is basically everything else in the xml tag except the name split on /\s+/. Any self closing '/' should be removed prior to the split.

Returns: a node ref that can be added to the node stack to kickstart stack tracing

add_node_to_stack( $node_ref )

Definition: Generally this is an internal method and should not be used. However, in order to provide a faster forward ability the node stack trace(ing) can be turned off. When you want to turn it back on you have to manually build the top node and store it to the node stack using this method. Adding a node after the stack trace has been turned off will create a discontinuity where the new node is added. Stack trace operations above this node will generally fail and stop the script.

Accepts: $node_ref = a top to push on the node stack for traceability

Returns: nothing

SUPPORT

github Spreadsheet::Reader::ExcelXML/issues

TODO

1. Nothing currently

AUTHOR

Jed Lund
jandrew@cpan.org

COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

DEPENDENCIES

Spreadsheet::Reader::ExcelXML - the package

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

Attributes

file

workbook_inst

xml_version

xml_encoding

xml_progid

xml_header

xml_doctype

position_index

file_type

stacking

Methods

start_the_file_over

good_load( $state )

loaded_correctly

parse_element( [$depth] )

advance_element_position( $element, [$iterations] )

next_sibling

skip_siblings

current_named_node

squash_node( $node )

extract_file( @node_list )

current_node_parsed

close_the_file

not_end_of_file

initial_node_build( $node_name, $attribute_list_ref )

add_node_to_stack( $node_ref )

SUPPORT

TODO

AUTHOR

COPYRIGHT

DEPENDENCIES

SEE ALSO

Module Install Instructions