Spreadsheet::Reader::ExcelXML::XMLReader - A minimal pure-perl xml reader class
package MyPackage; use MooseX::StrictConstructor; use MooseX::HasDefaults::RO; # You have to 'use' or build a the Workbook here or the XMLReader won't load # -> because the reader uses a regex to scrap imported methods use Spreadsheet::Reader::ExcelXML::Workbook; extends 'Spreadsheet::Reader::ExcelXML::XMLReader';
This documentation is written to explain ways to use this module when writing your own excel spreadsheet parser. I suppose the class could be used more generally but that's not why I wrote it and for now I have no intention of providing a full xml toolbox. For Excel spreadsheet parsing generally please start at the top level documentation. Workbooks, Worksheets, and Cells.
This class is meant to be used as the base reading class for specific types of xml files. The reader for those specific files will include roles that are useful for that files content. When the file first loads it will store some available information from the header (?) nodes and move to the first file node. At that point it will check if any of the consuming roles have a method '_load_unique_bits' If so it will call that method for additional meta data collection by that role.
This class will process the xml file in a just in time fashion holding enough information to know the level and open nodes not yet closed but nothing else. The intent is to use a little RAM as possible and process the file in the most (pure perl) computationaly efficient way possible. I welcome all suggestions for improvement.
Data passed to new when creating an instance. For modification of these attributes see the listed 'attribute methods'. For general information on attributes see Moose::Manual::Attributes. For ways to manage the instance after it is opened see the Methods.
Definition: This attribute holds the file handle for the file being read. If the full file name and path is passed to the attribute the class will coerce that into an IO::File file handle.
Default: no default - this must be provided to read a file
Required: yes
Range: any unencrypted xml file name and path or IO::File file handle set to read.
attribute methods Methods provided to adjust this attribute
set_file
Definition: change the file value in the attribute (this will reboot the file instance and lock the file)
get_file
Definition: Returns the file handle of the file even if a file name was passed
has_file
Definition: this is used to see if the file loaded correctly.
clear_file
Definition: this clears (and unlocks) the file handle
Delegated Methods
close
closes the file handle
seek
allows seek commands to be passed to the file handle
getline
returns the next line of the file handle with '<' set as the input_record_separator ($/)
Definition: This attribute holds a reference to the top level workbook (parser). The purpose is to use some of the methods provided there.
Default: no default
Required: not strictly for this class but the attribute is provided to give self referential access to general workbook settings and methods for composed classes that inherit this a base class.
Range: isa => 'Spreadsheet::Reader::ExcelXML::Workbook'
set_workbook_inst
set the attribute with a workbook instance
Delegated Methods (required) Methods delegated to this module by the attribute. All methods are delegated with the method name unchanged. Follow the link to review documentation of the provider for each method. As you can see several are delegated through the Workbook level and don't originate there.
"get_group_return_type" in Spreadsheet::Reader::ExcelXML
"counting_from_zero" in Spreadsheet::Reader::ExcelXML
"are_spaces_empty" in Spreadsheet::Reader::ExcelXML
"has_shared_strings_interface" in Spreadsheet::Reader::ExcelXML
"should_skip_hidden" in Spreadsheet::Reader::ExcelXML
"spreading_merged_values" in Spreadsheet::Reader::ExcelXML
"starts_at_the_edge" in Spreadsheet::Reader::ExcelXML
"get_empty_return_type" in Spreadsheet::Reader::ExcelXML
"get_values_only" in Spreadsheet::Reader::ExcelXML
"get_epoch_year" in Spreadsheet::Reader::ExcelXML
"get_error_inst" in Spreadsheet::Reader::ExcelXML
"has_styles_interface" in Spreadsheet::Reader::ExcelXML
"boundary_flag_setting" in Spreadsheet::Reader::ExcelXML
"is_empty_the_end" in Spreadsheet::Reader::ExcelXML
"get_rel_info" in Spreadsheet::Reader::ExcelXML
"get_sheet_info" in Spreadsheet::Reader::ExcelXML
"get_sheet_names" in Spreadsheet::Reader::ExcelXML
"collecting_merge_data" in Spreadsheet::Reader::ExcelXML
"collecting_column_formats" in Spreadsheet::Reader::ExcelXML
"set_error( $error_string )" in Spreadsheet::Reader::ExcelXML::Error
"get_defined_conversion( $position )" in Spreadsheet::Reader::Format
"set_defined_excel_formats( %args )" in Spreadsheet::Reader::Format
"parse_excel_format_string( $string, $name )" in Spreadsheet::Reader::Format
"change_output_encoding( $string )" in Spreadsheet::Reader::Format
"get_shared_string( $positive_int|$name )" in Spreadsheet::Reader::ExcelXML::SharedStrings
"get_format( ($position|$name), [$header], [$exclude_header] )" in Spreadsheet::Reader::ExcelXML::Styles
Definition: This stores the xml version read from the xml header. It is read when the file handle is first set in this sheet.
Default: no default - this is auto read from the header
Required: no
Range: xml versions
version
get the stored xml version
Definition: This stores the data encoding of the xml file from the xml header. It is read when the file handle is first set in this sheet.
Range: valid xml file encoding
encoding
get the attribute value
has_encoding
predicate for the attribute value
Definition: This is an attribute found in a secondary xml header that is associated with Excel 2003 xml based files. The value can be tested to see if the file was intended to be compliant with that format.
Range: a string
progid
has_progid
Definition: This stores the primary xml header string from the xml file. It is read when the file handle is first set in this sheet. I contains both the verion and the encoding where available and is used when building subsets of the file as standalone xml.
Range: valid xml file header
get_header
_set_xml_header
set the attribute value
Definition: This stores the DOCTYPE indicated in the XML header !DOCTYPE
Range: whatever it finds
doctype
has_doctype
predicate for the attribute
Definition: This attribute is available to facilitate other consuming roles and classes. Of this attributes methods only the 'clear_location' method is used in this class during the start_the_file_over method. It can be used for tracking positions with the same node name.
Default: no default - this is mostly managed by the role or child class
Range: Integer
where_am_i
i_am_here
clear_location
clear the attribute value
has_position
Definition: This is a static attribute that shows the file type
Default: xml
get_file_type
Definition: a pure perl xml parser will in general be slower than the C equivalent. To provide some acceleration to arrive at a target destination you can turn of the stack trace which will include building and storing the trace elements. This breaks things so don't do it without a solid understanding of what is happening. For instance if you turn this off and then call the method parse_element The parse_element method will have to turn the stack trace back on on it's own to build the element tree. The issue is that the most recent element at the base of the tree won't be available to build from. You will need to manually build it and push it to the stack. See the methods initial_node_build and add_node_to_stack to implement this.
Default: 1 = the stack trace is on
should_be_stacking
change_stack_storage_to( $Bool )
Turn the stack trace(r) state to $Bool (1 = on)
These are the methods provided by this class.
Definition: Clears the position_index, the old stack trace, and kick starts stack trace tracking again. It then uses seek(0, 0) to reset the file handle to the beginning. Finally, it reads the file until it gets to the first non-xml header node.
Accepts: nothing
Returns: nothing
Definition: a setter method to indicated if the file loaded correctly. This generally should be set by consuming roles in the load_unique_bits phase.
Accepts: (1|0)
Definition: a getter method to understand if the file loaded correctly. This is generally used by consumers of the instance to see if there was any trouble during the initial build.
Returns: 1 = good build, 0 = bad_build
Definition: This will read and store the full node from the current position down to an optional $depth. When the parse is complete the parser will be positioned at the beginning of the next node. The node does not include the top name but will include attributes.
Accepts: $depth = optional
Returns: A perl hash reference where all nodes at a level are listed using three hashref keys; list_keys, list, and attributes. The 'attributes' key points to a hash reference containing that nodes attributes. The 'list_keys' key points to an array reference with all the node names for each node at the next level down. The 'list' key points to an array reference of nodes or node values matching the position of the list_keys. There are two special case exceptions to this. First, for text values the node is listed as { raw_text => 'text node content' }. Second, if the attributes only include a 'val' key the node stores this under the 'val' key rather than the 'attributes' key with a sub key 'val'.
Definition: This will move the xml file reader forward until it finds the identified named $element. If the reader is already at an element of that name it will index forward until it finds the next $element of that name. If the optional positive $iterations integer is passed it will index to the named $element - $iterations times.
Accepts: $element = a case sensitive xml node name found forward of the current position in the file. [$iterations] = optional a positive integer indicating how many times to index forward to the named $element.
Returns: a list of 4 positions ( $success, $node_name, $node_level, $return_node_ref )
$success = a boolean value indicating whether the desired goal was met, $node_name = the actual node name for the final position (should match $element if $success), $node_level = the level of the final named node in the stack( not the sub text node ) $return_node_ref = When the stacking attribute is on this returns the last displaced elements in the stack displaced by the traverse of the xml tree. When stacking is off this returns an array ref of values used as the second argument in initial_node_build.
Definition: This will move the xml file reader forward until it finds next node at the same level as the current node within the same supernode. If this method finds a higher node prior to finding a node at the same level it will return failure and stop reading.
Definition: This will move the xml file reader forward until it finds next node higher. It will not stop on end nodes so it will continue to pass all closed nodes until it comes to the first open or self contained node above the current node.
Definition: when processing xml files in a just in time fashion there will be some ambiguity surrounding text nodes;
<t>sometext</t> <s> <r val="2"/>
In the 't' node example the content between the '>' character and the '<' characters are intentional and valuable to the data set. In the 's' and 'r' node example the space between those characters is only intended for human readability. This parser will not be able to tell the value of the content after the 's' node '>' character until the 'r' node is read. At that point the 's' node will no longer be the 'current' position. To resolve this, all content other than '' between '>' and '<' is treated as a node until the next node is read. Because these nodes are ambiguous the idea of a 'named node' is valuable and knowing what the most recent named node is can be useful. This method either returns the last read node or the second to last node if the last node is a raw text node. In the first example it would return the 't' node and in the second example it would return the 's' node.
Returns: a hash ref of information about the node containing the following keys;
level => counting from 0 at the start of the file and moving up type => regular = xml named node|#text = node built from the contents between the > and < characters name => the xml node name (for #text nodes this is 'raw_text') closed => (closed|open) depending on the current tag state initial_string => The string inside the < > quotes prior to parsing [attributes] => all attributes and values will be stored under the attribute name [val] => special case storage of one attribute
Definition: This takes a $node from the parse_element output and turns it into a more perl like reference. It checks the list_keys and if there are any duplicates it takes the list values and uses them as elements of an array ref assigned to a hash key called list. If there are no duplicates in the list_keys it turns the list_keys into hash keys with the list elements assigned as values. It then takes the attributes and mingles them in the hashref with the prior results. There are two special cases for a node reorganization. For nodes with a 'val' in the 'list_keys' then the element in the same position of the 'list' is returned as the whole ref. If there is a raw_text node it is returned as a hashref with one key 'raw_text' with the text itself as the value. This is all done recursivly so lower layers are assigned to upper layers using the rules above.
Accepts: the output of a parse_element call
Returns: a perl data structure with the xml organization removed
Definition: This will build an xml file and load it to a IO::Handle->new_tmpfile object. The xml is built on whole extracted xml strings defined by @node_list. If none of the node list elements is found in the parsed file then the first listed element from the node list will be used to create an empty self closing node.
Accepts: @node_list = Node list items can either be xml node name strings or array refs composed of two elements, first the node name and second the iterated position. Ex.
@node_list_example = ( 'r', [ 'si', 3 ] );
In this example the extracted file would contain the first 'r' node and the 3rd 'si' node.the output of a parse_element call. There is the exception case where you just want the whole file passed. The out here is to pass 'ALL_FILE' as the first element of the @node_list and a complete copy of the file_handle in read mode will be passed.
Returns: a File::Temp file handle loaded with an xml header and the listed nodes.
Definition: When nodes are read they are not completely processed to save cycles. If you want a fully processed result from the current node position including any embedded text then this is the method for you.
Accepts: Nothing
Returns: a perl ref equivalent to the squash_node call. This only returns the fully processed current_named_node and any sub text nodes.
Definition: It may be that the file(handle) may not be needed during the whole workbook parse. If so you can use this method to close (and clear / release) an open file handle as appropriate.
Returns: Nothing (the file handle is closed and cleared)
Definition: This is a poor mans End Of File test (EOF). The reader builds a node stack to keep track of where it is in the xml parse and when it runs out of nodes it means you are back at the top of the stack.
Returns: a count of the nodes in the node stack (header nodes are processed early on and are read and removed as part of startup)
Definition: Generally this is an internal method and should not be used. However, in order to provide a faster forward ability the node stack trace(ing) can be turned off. When you want to turn it back on you have to manually build the top node using this method and store it to the node stack using add_node_to_stack . This method will build the essentials for adding to the node stack. Please not that it will not necessarily get the node level right. If you need that to be correct then don't turn off the stack trace. It will not build raw_text nodes correctly.
Accepts: $node_name = a string without spaces for the name of the node, $attribute_list_ref = This is basically everything else in the xml tag except the name split on /\s+/. Any self closing '/' should be removed prior to the split.
Returns: a node ref that can be added to the node stack to kickstart stack tracing
Definition: Generally this is an internal method and should not be used. However, in order to provide a faster forward ability the node stack trace(ing) can be turned off. When you want to turn it back on you have to manually build the top node and store it to the node stack using this method. Adding a node after the stack trace has been turned off will create a discontinuity where the new node is added. Stack trace operations above this node will generally fail and stop the script.
Accepts: $node_ref = a top to push on the node stack for traceability
github Spreadsheet::Reader::ExcelXML/issues
1. Nothing currently
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
This software is copyrighted (c) 2016 by Jed Lund
Spreadsheet::Reader::ExcelXML - the package
Spreadsheet::Read - generic Spreadsheet reader
Spreadsheet::ParseExcel - Excel binary version 2003 and earlier (.xls files)
Spreadsheet::XLSX - Excel version 2007 and later
Spreadsheet::ParseXLSX - Excel version 2007 and later
Log::Shiras
All lines in this package that use Log::Shiras are commented out
To install Spreadsheet::Reader::ExcelXML, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Spreadsheet::Reader::ExcelXML
CPAN shell
perl -MCPAN -e shell install Spreadsheet::Reader::ExcelXML
For more information on module installation, please visit the detailed CPAN module installation guide.