The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Spreadsheet::Reader::ExcelXML::XMLReader - A minimal pure-perl xml reader class

SYNOPSIS

        package MyPackage;
        use MooseX::StrictConstructor;
        use MooseX::HasDefaults::RO;
        # You have to 'use' or build a the Workbook here or the XMLReader won't load
        #  -> because the reader uses a regex to scrap imported methods
        use Spreadsheet::Reader::ExcelXML::Workbook;
        extends 'Spreadsheet::Reader::ExcelXML::XMLReader';

DESCRIPTION

This documentation is written to explain ways to use this module when writing your own excel spreadsheet parser. I suppose the class could be used more generally but that's not why I wrote it and for now I have no intention of providing a full xml toolbox. For Excel spreadsheet parsing generally please start at the top level documentation. Workbooks, Worksheets, and Cells.

This class is meant to be used as the base reading class for specific types of xml files. The reader for those specific files will include roles that are useful for that files content. When the file first loads it will store some available information from the header (?) nodes and move to the first file node. At that point it will check if any of the consuming roles have a method '_load_unique_bits' If so it will call that method for additional meta data collection by that role.

This class will process the xml file in a just in time fashion holding enough information to know the level and open nodes not yet closed but nothing else. The intent is to use a little RAM as possible and process the file in the most (pure perl) computationaly efficient way possible. I welcome all suggestions for improvement.

Attributes

Data passed to new when creating an instance. For modification of these attributes see the listed 'attribute methods'. For general information on attributes see Moose::Manual::Attributes. For ways to manage the instance after it is opened see the Methods.

file

    Definition: This attribute holds the file handle for the file being read. If the full file name and path is passed to the attribute the class will coerce that into an IO::File file handle.

    Default: no default - this must be provided to read a file

    Required: yes

    Range: any unencrypted xml file name and path or IO::File file handle set to read.

    attribute methods Methods provided to adjust this attribute

      set_file

        Definition: change the file value in the attribute (this will reboot the file instance and lock the file)

      get_file

        Definition: Returns the file handle of the file even if a file name was passed

      has_file

        Definition: this is used to see if the file loaded correctly.

      clear_file

        Definition: this clears (and unlocks) the file handle

    Delegated Methods

workbook_inst

xml_version

    Definition: This stores the xml version read from the xml header. It is read when the file handle is first set in this sheet.

    Default: no default - this is auto read from the header

    Required: no

    Range: xml versions

    attribute methods Methods provided to adjust this attribute

      version

        get the stored xml version

xml_encoding

    Definition: This stores the data encoding of the xml file from the xml header. It is read when the file handle is first set in this sheet.

    Default: no default - this is auto read from the header

    Required: no

    Range: valid xml file encoding

    attribute methods Methods provided to adjust this attribute

      encoding

        get the attribute value

      has_encoding

        predicate for the attribute value

xml_progid

    Definition: This is an attribute found in a secondary xml header that is associated with Excel 2003 xml based files. The value can be tested to see if the file was intended to be compliant with that format.

    Default: no default - this is auto read from the header

    Required: no

    Range: a string

    attribute methods Methods provided to adjust this attribute

      progid

        get the attribute value

      has_progid

        predicate for the attribute value

xml_header

    Definition: This stores the primary xml header string from the xml file. It is read when the file handle is first set in this sheet. I contains both the verion and the encoding where available and is used when building subsets of the file as standalone xml.

    Default: no default - this is auto read from the header

    Required: no

    Range: valid xml file header

    attribute methods Methods provided to adjust this attribute

      get_header

        get the attribute value

      _set_xml_header

        set the attribute value

xml_doctype

    Definition: This stores the DOCTYPE indicated in the XML header !DOCTYPE

    Default: no default - this is auto read from the header

    Required: no

    Range: whatever it finds

    attribute methods Methods provided to adjust this attribute

      doctype

        get the attribute value

      has_doctype

        predicate for the attribute

position_index

    Definition: This attribute is available to facilitate other consuming roles and classes. Of this attributes methods only the 'clear_location' method is used in this class during the start_the_file_over method. It can be used for tracking positions with the same node name.

    Default: no default - this is mostly managed by the role or child class

    Required: no

    Range: Integer

    attribute methods Methods provided to adjust this attribute

      where_am_i

        get the attribute value

      i_am_here

        set the attribute value

      clear_location

        clear the attribute value

      has_position

        set the attribute value

file_type

    Definition: This is a static attribute that shows the file type

    Default: xml

    attribute methods Methods provided to adjust this attribute

      get_file_type

        get the attribute value

stacking

    Definition: a pure perl xml parser will in general be slower than the C equivalent. To provide some acceleration to arrive at a target destination you can turn of the stack trace which will include building and storing the trace elements. This breaks things so don't do it without a solid understanding of what is happening. For instance if you turn this off and then call the method parse_element The parse_element method will have to turn the stack trace back on on it's own to build the element tree. The issue is that the most recent element at the base of the tree won't be available to build from. You will need to manually build it and push it to the stack. See the methods initial_node_build and add_node_to_stack to implement this.

    Default: 1 = the stack trace is on

    attribute methods Methods provided to adjust this attribute

      should_be_stacking

        get the attribute value

      change_stack_storage_to( $Bool )

        Turn the stack trace(r) state to $Bool (1 = on)

Methods

These are the methods provided by this class.

start_the_file_over

    Definition: Clears the position_index, the old stack trace, and kick starts stack trace tracking again. It then uses seek(0, 0) to reset the file handle to the beginning. Finally, it reads the file until it gets to the first non-xml header node.

    Accepts: nothing

    Returns: nothing

good_load( $state )

    Definition: a setter method to indicated if the file loaded correctly. This generally should be set by consuming roles in the load_unique_bits phase.

    Accepts: (1|0)

    Returns: nothing

loaded_correctly

    Definition: a getter method to understand if the file loaded correctly. This is generally used by consumers of the instance to see if there was any trouble during the initial build.

    Accepts: nothing

    Returns: 1 = good build, 0 = bad_build

parse_element( [$depth] )

    Definition: This will read and store the full node from the current position down to an optional $depth. When the parse is complete the parser will be positioned at the beginning of the next node. The node does not include the top name but will include attributes.

    Accepts: $depth = optional

    Returns: A perl hash reference where all nodes at a level are listed using three hashref keys; list_keys, list, and attributes. The 'attributes' key points to a hash reference containing that nodes attributes. The 'list_keys' key points to an array reference with all the node names for each node at the next level down. The 'list' key points to an array reference of nodes or node values matching the position of the list_keys. There are two special case exceptions to this. First, for text values the node is listed as { raw_text => 'text node content' }. Second, if the attributes only include a 'val' key the node stores this under the 'val' key rather than the 'attributes' key with a sub key 'val'.

advance_element_position( $element, [$iterations] )

    Definition: This will move the xml file reader forward until it finds the identified named $element. If the reader is already at an element of that name it will index forward until it finds the next $element of that name. If the optional positive $iterations integer is passed it will index to the named $element - $iterations times.

    Accepts: $element = a case sensitive xml node name found forward of the current position in the file. [$iterations] = optional a positive integer indicating how many times to index forward to the named $element.

    Returns: a list of 4 positions ( $success, $node_name, $node_level, $return_node_ref )

    $success = a boolean value indicating whether the desired goal was met, $node_name = the actual node name for the final position (should match $element if $success), $node_level = the level of the final named node in the stack( not the sub text node ) $return_node_ref = When the stacking attribute is on this returns the last displaced elements in the stack displaced by the traverse of the xml tree. When stacking is off this returns an array ref of values used as the second argument in initial_node_build.

next_sibling

    Definition: This will move the xml file reader forward until it finds next node at the same level as the current node within the same supernode. If this method finds a higher node prior to finding a node at the same level it will return failure and stop reading.

    Accepts: nothing

    Returns: a list of 4 positions ( $success, $node_name, $node_level, $return_node_ref )

    $success = a boolean value indicating whether the desired goal was met, $node_name = the actual node name for the final position (should match $element if $success), $node_level = the level of the final named node in the stack( not the sub text node ) $return_node_ref = When the stacking attribute is on this returns the last displaced elements in the stack displaced by the traverse of the xml tree. When stacking is off this returns an array ref of values used as the second argument in initial_node_build.

skip_siblings

    Definition: This will move the xml file reader forward until it finds next node higher. It will not stop on end nodes so it will continue to pass all closed nodes until it comes to the first open or self contained node above the current node.

    Accepts: nothing

    Returns: a list of 4 positions ( $success, $node_name, $node_level, $return_node_ref )

    $success = a boolean value indicating whether the desired goal was met, $node_name = the actual node name for the final position (should match $element if $success), $node_level = the level of the final named node in the stack( not the sub text node ) $return_node_ref = When the stacking attribute is on this returns the last displaced elements in the stack displaced by the traverse of the xml tree. When stacking is off this returns an array ref of values used as the second argument in initial_node_build.

current_named_node

    Definition: when processing xml files in a just in time fashion there will be some ambiguity surrounding text nodes;

            <t>sometext</t>
            <s>
               <r val="2"/>

    In the 't' node example the content between the '>' character and the '<' characters are intentional and valuable to the data set. In the 's' and 'r' node example the space between those characters is only intended for human readability. This parser will not be able to tell the value of the content after the 's' node '>' character until the 'r' node is read. At that point the 's' node will no longer be the 'current' position. To resolve this, all content other than '' between '>' and '<' is treated as a node until the next node is read. Because these nodes are ambiguous the idea of a 'named node' is valuable and knowing what the most recent named node is can be useful. This method either returns the last read node or the second to last node if the last node is a raw text node. In the first example it would return the 't' node and in the second example it would return the 's' node.

    Accepts: nothing

    Returns: a hash ref of information about the node containing the following keys;

            level => counting from 0 at the start of the file and moving up
            type => regular = xml named node|#text = node built from the contents between the > and < characters
            name => the xml node name (for #text nodes this is 'raw_text')
            closed => (closed|open) depending on the current tag state
            initial_string => The string inside the < > quotes prior to parsing
            [attributes] => all attributes and values will be stored under the attribute name
            [val] => special case storage of one attribute

squash_node( $node )

    Definition: This takes a $node from the parse_element output and turns it into a more perl like reference. It checks the list_keys and if there are any duplicates it takes the list values and uses them as elements of an array ref assigned to a hash key called list. If there are no duplicates in the list_keys it turns the list_keys into hash keys with the list elements assigned as values. It then takes the attributes and mingles them in the hashref with the prior results. There are two special cases for a node reorganization. For nodes with a 'val' in the 'list_keys' then the element in the same position of the 'list' is returned as the whole ref. If there is a raw_text node it is returned as a hashref with one key 'raw_text' with the text itself as the value. This is all done recursivly so lower layers are assigned to upper layers using the rules above.

    Accepts: the output of a parse_element call

    Returns: a perl data structure with the xml organization removed

extract_file( @node_list )

    Definition: This will build an xml file and load it to a IO::Handle->new_tmpfile object. The xml is built on whole extracted xml strings defined by @node_list. If none of the node list elements is found in the parsed file then the first listed element from the node list will be used to create an empty self closing node.

    Accepts: @node_list = Node list items can either be xml node name strings or array refs composed of two elements, first the node name and second the iterated position. Ex.

            @node_list_example = ( 'r', [ 'si', 3 ] );

    In this example the extracted file would contain the first 'r' node and the 3rd 'si' node.the output of a parse_element call. There is the exception case where you just want the whole file passed. The out here is to pass 'ALL_FILE' as the first element of the @node_list and a complete copy of the file_handle in read mode will be passed.

    Returns: a File::Temp file handle loaded with an xml header and the listed nodes.

current_node_parsed

    Definition: When nodes are read they are not completely processed to save cycles. If you want a fully processed result from the current node position including any embedded text then this is the method for you.

    Accepts: Nothing

    Returns: a perl ref equivalent to the squash_node call. This only returns the fully processed current_named_node and any sub text nodes.

close_the_file

    Definition: It may be that the file(handle) may not be needed during the whole workbook parse. If so you can use this method to close (and clear / release) an open file handle as appropriate.

    Accepts: Nothing

    Returns: Nothing (the file handle is closed and cleared)

not_end_of_file

    Definition: This is a poor mans End Of File test (EOF). The reader builds a node stack to keep track of where it is in the xml parse and when it runs out of nodes it means you are back at the top of the stack.

    Accepts: Nothing

    Returns: a count of the nodes in the node stack (header nodes are processed early on and are read and removed as part of startup)

initial_node_build( $node_name, $attribute_list_ref )

    Definition: Generally this is an internal method and should not be used. However, in order to provide a faster forward ability the node stack trace(ing) can be turned off. When you want to turn it back on you have to manually build the top node using this method and store it to the node stack using add_node_to_stack . This method will build the essentials for adding to the node stack. Please not that it will not necessarily get the node level right. If you need that to be correct then don't turn off the stack trace. It will not build raw_text nodes correctly.

    Accepts: $node_name = a string without spaces for the name of the node, $attribute_list_ref = This is basically everything else in the xml tag except the name split on /\s+/. Any self closing '/' should be removed prior to the split.

    Returns: a node ref that can be added to the node stack to kickstart stack tracing

add_node_to_stack( $node_ref )

    Definition: Generally this is an internal method and should not be used. However, in order to provide a faster forward ability the node stack trace(ing) can be turned off. When you want to turn it back on you have to manually build the top node and store it to the node stack using this method. Adding a node after the stack trace has been turned off will create a discontinuity where the new node is added. Stack trace operations above this node will generally fail and stop the script.

    Accepts: $node_ref = a top to push on the node stack for traceability

    Returns: nothing

SUPPORT

TODO

    1. Nothing currently

AUTHOR

Jed Lund
jandrew@cpan.org

COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

This software is copyrighted (c) 2016 by Jed Lund

DEPENDENCIES

SEE ALSO