Why not adopt me?

This distribution is up for adoption! If you're interested then please contact the PAUSE module admins via email.

NAME

XML::Rules - parse XML & process tags by rules starting from leaves

VERSION

Version 0.09

SYNOPSIS

    use XML::Rules;

	$xml = <<'*END*'
	<doc>
	 <person>
	  <fname>...</fname>
	  <lname>...</lname>
	  <email>...</email>
	  <address>
	   <street>...</street>
	   <city>...</city>
	   <country>...</country>
	   <bogus>...</bogus>
	  </address>
	  <phones>
	   <phone type="home">123-456-7890</phone>
	   <phone type="office">663-486-7890</phone>
	   <phone type="fax">663-486-7000</phone>
	  </phones>
	 </person>
	 <person>
	  <fname>...</fname>
	  <lname>...</lname>
	  <email>...</email>
	  <address>
	   <street>...</street>
	   <city>...</city>
	   <country>...</country>
	   <bogus>...</bogus>
	  </address>
	  <phones>
	   <phone type="office">663-486-7891</phone>
	  </phones>
	 </person>
	</doc>
	*END*

	@rules = (
		_default => sub {$_[0] => $_[1]->{_content}},
			# by default I'm only interested in the content of the tag, not the attributes
		bogus => undef,
			# let's ignore this tag and all inner ones as well
		address => sub {address => "$_[1]->{street}, $_[1]->{city} ($_[1]->{country})"},
			# merge the address into a single string
		phone => sub {$_[1]->{type} => $_[1]->{content}},
			# let's use the "type" attribute as the key and the content as the value
		phones => sub {delete $_[1]->{_content}; %{$_[1]}},
			# remove the text content and pass along the type => content from the child nodes
		person => sub { # lets print the values, all the data is readily available in the attributes
			print "$_[1]->{lname}, $_[1]->{fname} <$_[1]->{email}>\n";
			print "Home phone: $_[1]->{home}\n" if $_[1]->{home};
			print "Office phone: $_[1]->{office}\n" if $_[1]->{office};
			print "Fax: $_[1]->{fax}\n" if $_[1]->{fax};
			print "$_[1]->{address}\n\n";
			return; # the <person> tag is processed, no need to remember what it contained
		},
	);
	$parser = XML::Rules->new(rules => \@rules);
	$parser->Parse( $xml);

CONSTRUCTOR

my $parser = XML::Rules->new(
	rules => \@rules,
	[ start_rules => \@start_rules, ]
	[ style => 'parser' / 'filter', ]
	# and optionaly parameters passed to XML::Parser::Expat
);

Options passed to XML::Parser::Expat: ProtocolEncoding Namespaces NoExpand Stream_Delimiter ErrorContext ParseParamEnt Base

The style specifies whether you want to build a parser used to extract stuff from the XML or filter/modify the XML. If you specify style => 'filter' then all tags for which you do not specify a subroutine rule or that occure inside such a tag are copied to the output filehandle passed to the ->filter() or ->filterfile() methods.

The Rules

The rules option may be either an arrayref or a hashref, the module doesn't care, but if you want to use regexps to specify the groups of tags to be handled by the same rule you should use the array ref. The rules array/hash is made of pairs in form

tagspecification => action

where the tagspecification may be either a name of a tag, a string containing comma or pipe ( "|" ) delimited list of tag names or a string containing a regexp enclosed in // with optional parameters or a qr// compiled regular expressions. The tag names and tag name lists take precedence to the regexps, the regexps are (in case of arrayref only!!!) tested in the order in which they are specified.

These rules are evaluated/executed whenever a tag if fully parsedin including all the content and child tags and they may access the content and attributes of the specified tag plus the stuff produced by the rules evaluated for the child tags.

The action may be either

an undef or empty string = ignore the tag and all its children
a subroutine reference = the subroutine will be called to handle the tag data&contents
'content' = only the content of the tag is preserved and added to
	the parent tag's hash as an attribute named after the tag
	sub { $_[0] => $_[1]->{_content}}
'content trim' = only the content of the tag is preserved, trimmed and added to
	the parent tag's hash as an attribute named after the tag
	sub { s/^\s+//,s/\s+$// for ($_[1]->{_content}); $_[0] => $_[1]->{_content}}
'as is' = the tag's hash is added to the parent tag's hash
	as an attribute named after the tag
	sub { $_[0] => $_[1]}
'as is trim' = the tag's hash is added to the parent tag's hash
	as an attribute named after the tag, the content is trimmed
	sub { $_[0] => $_[1]}
'as array' = the tag's hash is pushed to the attribute named after the tag
	in the parent tag's hash
	sub { '@'.$_[0] => $_[1]}
'as array trim' = the tag's hash is pushed to the attribute named after the tag
	in the parent tag's hash, the content is trimmed
	sub { '@'.$_[0] => $_[1]}
'no content' = the _content is removed from the tag's hash and the hash
	is added to the parent's hash into the attribute named after the tag
	sub { delete $_[1]->{_content}; $_[0] => $_[1]}
'no content array' = similar to 'no content' except the hash is pushed
	into the array referenced by the attribute
'as array no content' = same as 'no content array'
'pass' = the tag's hash is dissolved into the parent's hash,
	that is all tag's attributes become the parent's attributes.
	The _content is appended to the parent's _content.
	sub { %{$_[0]}}
'pass no content' = the _content is removed and the hash is dissolved
	into the parent's hash.
	sub { delete $_[1]->{_content}; %{$_[0]}}
'pass without content' = same as 'pass no content'
'raw' = the [tagname => attrs] is pushed to the parent tag's _content.
	You would use this styleif you wanted to be able to print
	the parent tag as XML preserving the whitespace or other textual content
	sub { [$_[0] => $_[1]]}
'raw extended' = the [tagname => attrs] is pushed to the parent tag's _content
	and the attrs are added to the parent's attribute hash with ":$tagname" as the key
	sub { (':'.$Element => $data, [$Element => $data])};

The subroutines in the rules specification receive five parameters:

$rule->( $tag_name, \%attrs, \@context, \@parent_data, $parser)

It's OK to destroy the first two parameters, but you should treat the other three as read only!

$tag_name = string containing the tag name
\%attrs = hash containing the attributes of the tag plus the _content key
	containing the text content of the tag. If it's not a leaf tag it may
	also contain the data returned by the rules invoked for the child tags.
\@context = an array containing the names of the tags enclosing the current
	one. The parent tag name is the last element of the array.
\@parent_data = an array containing the hashes with the attributes
	and content read&produced for the enclosing tags so far.
	You may need to access this for example to find out the version
	of the format specified as an attribute of the root tag. You may
	safely add, change or delete attributes in the hashes, but all bets
	are off if you change the number or type of elements of this array!
$parser = the parser object.

The subroutine may decide to handle the data and return nothing or tweak the data as necessary and return just the relevant bits. It may also load more information from elsewhere based on the ids found in the XML and provide it to the rules of the ancestor tags as if it was part of the XML.

The possible return values of the subroutines are:

1) nothing or undef or "" - nothing gets added to the parent tag's hash

2) a single string - if the parent's _content is a string then the one produced by this rule is appended to the parent's _content. If the parent's _content is an array, then the string is push()ed to the array.

3) a single reference - if the parent's _content is a string then it's changed to an array containing the original string and this reference. If the parent's _content is an array, then the string is push()ed to the array.

4) an even numbered list - it's a list of key & value pairs to be added to the parent's hash.

The handling of the attributes may be changed by adding '@', '+', '*' or '.' before the attribute name.

Without any "sigil" the key & value is added to the hash overwriting any previous values. The values for the keys starting with '@' are push()ed to the arrays referenced by the key name without the @. If there already is an attribute of the same name then the value will be preserved and will become the first element in the array. The values for the keys starting with '+' are added to the current value, the ones starting with '.' are appended to the current value and the ones starting with '*' are multiplied by the current value.

5) an odd numbered list - the last element is appended or push()ed to the parent's _content, the rest is handled as in the previous case.

The Start Rules

Apart from the normal rules that get invoked once the tag is fully parsed, including the contents and child tags, you may want to attach some code to the start tag to (optionaly) skip whole branches of XML or set up attributes and variables. You may set up the start rules either in a separate parameter to the constructor or in the rules=> by prepending the tag name(s) by ^.

These rules are in form

tagspecification => undef / '' / 'skip'	--> skip the element, including child tags
tagspecification => 1 / 'handle'	--> handle the element, may be needed
	if you specify the _default rule.
tagspecification => \&subroutine

The subroutines receive the same parameters as for the (end tag) rules, but their return value is treated differently. If the subroutine returns a false value then the whole branch enclosed by the current tag is skipped, no data are stored and no rules are executed. You may modify the hash referenced by $attr.

Both types of rules are free to store any data they want in $parser->{pad}. This property is NOT emptied after the parsing!

METHODS

parse

$parser->parse( $string [, $parameters]);
$parser->parse( $IOhandle [, $parameters]);

Parses the XML in the string or reads and parses the XML from the opened IO handle, executes the rules as it encounters the closing tags and returns the resulting structure.

The scalar or reference passed as the second parameter to the parse() method is assigned to $parser->{parameters} for the parsing of the file or string. Once the XML is parsed the key is deleted. This means that the $parser does not retain a reference to the $parameters after the parsing.

parsestring

$parser->parsestring( $string [, $parameters]);

Just an alias to ->parse().

parsefile

$parser->parsefile( $filename [, $parameters]);

Opens the specified file and parses the XML and executes the rules as it encounters the closing tags and returns the resulting structure.

filter

$parser->filter( $string, $OutputIOhandle [, $parameters]);
$parser->filter( $InputIOhandle, $OutputIOhandle [, $parameters]);

Parses the XML in the string or reads and parses the XML from the opened IO handle, copies the tags that do not have a subroutine rule specified and do not occure under such a tag, executes the specified rules and prints the results.

The scalar or reference passed as the third parameter to the filter() method is assigned to $parser->{parameters} for the parsing of the file or string. Once the XML is parsed the key is deleted. This means that the $parser does not retain a reference to the $parameters after the parsing.

filterstring

$parser->filterstring( $string, $OutputIOhandle [, $parameters]);

Just an alias to ->filter().

filterfile

$parser->filterfile( $filename, $OutputIOhandle [, $parameters]);

Opens the specified file and parses the XML and executes the rules as it encounters the closing tags and returns the resulting structure.

escape_value

$parser->escape_value( $data [, $numericescape])

This method escapes the $data for inclusion in XML, the $numericescape may be 0, 1 or 2 and controls whether to convert 'high' (non ASCII) characters to XML entities.

0 - default: no numeric escaping (OK if you're writing out UTF8)

1 - only characters above 0xFF are escaped (ie: characters in the 0x80-FF range are not escaped), possibly useful with ISO8859-1 output

2 - all characters above 0x7F are escaped (good for plain ASCII output)

You can also specify the default value in the constructor

my $parser = XML::Rules->new(
	...
	NumericEscape => 2,
);

toXML

$xml = $parser->toXML( $tagname, \%attrs[, $do_not_close])

You may use this method to convert the datastructures created by parsing the XML into the XML format. Not all data structures may be printed! I'll add more docs later, for not please do experiment.

parentsToXML

$xml = $parser->parentsToXML( [$level])

Prints all or only the topmost $level ancestor tags, including the attributes and content (parsed so far), but without the closing tags. You may use this to print the header of the file you are parsing, followed by calling toXML() on a structure you build and then by closeParentsToXML() to close the tags left opened by parentsToXML(). You most likely want to use the style => 'filter' option for the constructor instead.

closeParentsToXML

$xml = $parser->closeParentsToXML( [$level])

Prints the closing tags for all or the topmost $level ancestor tags of the one currently processed.

HOW TO USE

You may view the module either as a XML::Simple on steriods and use it to build a data structure similar to the one produced by XML::Simple with the added benefit of being able to specify what tags or attributes to ignore, when to take just the content, what to store as an array etc.

Or you could view it as yet another event based XML parser that differs from all the others only in one thing. It stores the data for you so that you do not have to use globals or closures and wonder where to attach the snippet of data you just received onto the structure you are building.

You can use it in a way similar to XML::Twig with simplify(), specify the rules to transform the lower level tags into a XML::Simple like (simplify()ed) structure and then handle the structure in the rule for the tag(s) you'd specify in XML::Twig's twig_roots.

AUTHOR

Jan Krynicky, <Jenda at CPAN.org>

BUGS

Please report any bugs or feature requests to bug-xml-rules at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=XML-Rules. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc XML::Rules

You can also look for information at:

AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/XML-Rules
CPAN Ratings

http://cpanratings.perl.org/d/XML-Rules
RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=XML-Rules
Search CPAN

http://search.cpan.org/dist/XML-Rules
PerlMonks

Please see http://www.perlmonks.org/?node_id=581313 or http://www.perlmonks.org/?node=XML::Rulesfor discussion.

ACKNOWLEDGEMENTS

The escape_value() method is taken with minor changes from XML::Simple.

COPYRIGHT & LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install XML::Rules, copy and paste the appropriate command in to your terminal.

cpanm

cpanm XML::Rules

CPAN shell

perl -MCPAN -e shell
install XML::Rules

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)