XML::Xerces::DOMParse - A Perl module for parsing DOMs.


        # Here;s an example that reads in an XML file from the 
        # command line and then removes all formatting, re-adds
        # formatting and then prints the DOM back to a file.

        use XML::Xerces;
        use XML::Xerces::DOMParse;

        my $parser = new XML::Xerces::DOMParser ();
        $parser->parse ($ARGV[0]);
        my $doc = $parser->getDocument ();

        XML::Xerces::DOMParse::unformat ($doc);
        XML::Xerces::DOMParse::format ($doc);
        XML::Xerces::DOMParse::print (\*STDOUT, $doc);


Use this module in conjunction with XML::Xerces. Once you have read an XML file into a DOM tree in memory, this module provides routines for recursive descent parsing of the DOM tree. It also provides three concrete and useful functions to format, unformat and print DOM trees, all which are built on the more general parsing functions.


DOMParse::unformat ($node)

Processes $node and its children recursively and removes all white space text nodes. It is often difficult to process a DOM tree with formatting while preserving reasonable formatting. Use unformat to remove formatting, then proces the unformatted DOM, then use format to add formatting back in that is reasonable for the new tree.

DOMParse::format ($node)

Processes $node and its children recursively and introduces white space text nodes to create a DOM tree that will print with reasonable indents and newlines. Only call format on a DOM tree that nas no formatting white space in it. Otherwise the results will be incorrect. Call unformat to remove formatting white space.

You can optionally set the string variable $INDENT to the indent characters you want to use. By default it is a single tab.

DOMParse::print ($file_handle, $node)

Processes $node and its children recursively and prints the DOM tree to $file_handle as a standard XML file. You can override printing behavior by supplying any of several "printer" functions.


Some of these printers call other printers. For example, $NODE_PRINTER determines the node type and calls the correponsing printer for that type, e.g. $ELEMENT_NODE_PRINTER. So if you replace a printer for a node which has children, you must take the responsibility for calling the child node printers.

All printers take two parameters, a file handle and the node. See DOMParse::parse_nodes and DOMParse::parse_child_nodes for details.

It is very easy to write a replacement printer that adds value and then calls the default processing as follows.

        my $original_text_node_printer = $TEXT_NODE_PRINTER;
        $TEXT_NODE_PRINTER = \&my_text_node_printer;

        sub my_text_node_printer {
          my ($fh, $node) = @_;
          # look at the text node and do something extra
          return &$original_text_node_printer ($fh, $node);

The $ESCAPE variable (integer) controls whether special XML characters like ampersand "&" are escaped, e.g. "&". Set $ESCAPE to 1 (default) to escape special characters, or to 0 to print characters literally.

Call print_string whenever you need to expand special characters (& < > ") to their escape sequence equivalents. The print_string is used extensively by the default implementation of DOMParse::print. When you replace various node printers, you should also be careful to use it to print node and attribute names and values (but probably not anything else).

The print function respects the global $ESCAPE flag. By default it is set to true (1) and escape conversion is performed. Set it to false (0) when you don't want escape conversion.

parse_nodes ($node, $process_node, $data)

Call parse_nodes to parse $node and all of its children recursively. Each node will be visited and your parsing function, $process_node, will be called. Optional data $data will be passed through if provided.

Your parsing funtion must have the following signature.

        process_node ($node, $data)

If it returns 1 then children of $node will also be parsed. If it returns 0 then they won't. It is common to use one parsing function to get to a certain level in the DOM tree, then to return 0 and to call parse_child_nodes to parse nodes under that level with a different processing function.

parse_child_nodes ($node, $process_node, $data)

Call to parse the children of $node recursively. This is just like parse_nodes except that $node is not parsed.

doc ($node)

Looks up the DOM tree until it finds the document node associated with the given $node. Then returns the document node.

depth ($node)

Returns the depth of the specified $node in the DOM document. The document has depth 0, the root node has depth 1, and so on.

element_text ($node)

It is common practice to have an element node that encloses a single text node. If you know you have such a node, you can call element_text to directly access the enclosed text as a string. This is faster than accessing the enclosed text node and then getting the value of it.

insert_before ($ref_node, $new_node)

Inserts $new_node in the DOM tree immediately before and as a sibling of $ref_node. It is safe to call insert_before while in the middle of parsing a DOM tree if $ref_node is the current node being parsed. The newly inserted node will not be parsed.

insert_after ($ref_node, $new_node)

Inserts $new_node in the DOM tree immediately after and as a sibling of $ref_node. It is safe to call insert_after while in the middle of parsing a DOM tree if $ref_node is the current node being parsed. The newly inserted node will not be parsed.

remove ($node)

Removes $node from the DOM tree. It is safe to call remove while in the middle of parsing a DOM tree if $node is the current node being parsed. The next node to be parsed will be the same that would have been parsed had $node not been removed, e.g. $node's next sibling.


Tom Watson <> wrote version 1.0 and submitted to the XML Apache project <>, where you can contribute to future versions and where the corresponding C++ and Java compilers are also developed as OpenSource projects.

Jason Stewart <> adapted it to the Xerces-1.3 API.


Any comments or questions about this module can be addressed to the development list <>