Andrew Main (Zefram)

NAME

XML::Easy::SimpleSchemaUtil - help with simple kinds of XML schema

SYNOPSIS

        use XML::Easy::SimpleSchemaUtil qw(
                xml_s_canonise_chars xml_c_canonise_chars
                xml_c_subelements xml_c_chardata
        );

        $chardata = xml_s_canonise_chars($chardata);
        $content = xml_c_canonise_chars($content);
        $subelements = xml_c_subelements($content);
        $chars = xml_c_chardata($content);

DESCRIPTION

The rules by which some class of thing is encoded in XML constitute a schema. (A schema does not need to be codified in a formal language such as Schematron: a natural-language specification can also be a schema. Even if there is no explicit specification at all, the behaviour of the interoperating processors of related XML documents constitutes a de facto schema.) Certain kinds of rule are commonly used in all manner of schemata. This module supplies functions that help to implement such common kinds of rule, regardless of how a schema is specified.

This module processes XML data in the form used by XML::Easy, consisting of XML::Easy::Element and XML::Easy::Content objects and twine arrays. In this form, character data are stored fully decoded, so they can be manipulated with no knowledge of XML syntax.

FUNCTIONS

Each function has two names. There is a longer descriptive name, and a shorter name to spare screen space and the programmer's fingers.

xml_s_canonise_chars(STRING, OPTIONS)
xs_charcanon(STRING, OPTIONS)

This function is intended to help in parsing XML data, in situations where the schema states that some aspects of characters are not entirely significant. STRING must be a plain Perl string consisting of character data that is valid for XML. The function examines the characters, processes them as specified in the OPTIONS, and returns a modified version of the string. OPTIONS must be a reference to a hash, in which the permitted keys are:

leading_wsp
intermediate_wsp
trailing_wsp

Controls handling of sequences of whitespace characters. The three keys control, respectively, whitespace at the beginning of the string, whitespace that is at neither the beginning nor the end, and whitespace at the end of the string. If the entire content of the string is whitespace, it is treated as both leading and trailing.

The whitespace characters, for this purpose, are tab, linefeed/newline, carriage return, and space. This is the same set of characters that are whitespace for the purposes of the XML syntax.

The value for each key may be:

DELETE

Completely remove the whitespace. For situations where the whitespace is of no significance at all. (Common for leading and trailing whitespace, but rare for intermediate whitespace.)

COMPRESS

Replace the whitespace sequence with a single space character. For situations where the presence of whitespace is significant but the length and type are not. (Common for intermediate whitespace.)

PRESERVE (default)

Leave the whitespace unchanged. For situations where the exact type of whitespace is significant.

xml_c_canonise_chars(CONTENT, OPTIONS)
xc_charcanon(CONTENT, OPTIONS)

This function is intended to help in parsing XML data, in situations where the schema states that some aspects of characters are not entirely significant. CONTENT must be a reference to either an XML::Easy::Content object or a twine array. The function processes its top-level character content in the same way as "xml_s_canonise_chars", and returns the resulting modified version of the content in the same form that the input supplied.

Any element inside the content chunk acts like a special character that will not be modified. It interrupts any character sequence of interest. Elements are not processed recursively: they are treated as atomic.

xml_c_subelements(CONTENT, ALLOW_WSP)
xc_subelems(CONTENT, ALLOW_WSP)

This function is intended to help in parsing XML data, in situations where the schema calls for an element to contain only subelements, possibly with optional whitespace around and between them.

CONTENT must be a reference to either an XML::Easy::Content object or a twine array. The function checks whether the content includes any unpermitted characters at the top level, and dies if it does. If the content is of permitted form, the function returns a reference to an array listing all the subelements.

ALLOW_WSP is a truth value controlling whether whitespace is permitted around and between the subelements. The characters recognised as whitespace are the same as those for XML syntax. Allowing whitespace in this way is easier (and slightly more efficient) than first filtering it out via "xml_c_canonise_chars". Non-whitespace characters are never permitted.

xml_c_chardata(CONTENT)
xc_chars(CONTENT)

This function is intended to help in parsing XML data, in situations where the schema calls for an element to contain only character data. CONTENT must be a reference to either an XML::Easy::Content object or a twine array. The function dies if it contains any subelements. If the content is of permitted form, the function returns a string containing all the character content.

SEE ALSO

XML::Easy::NodeBasics

AUTHOR

Andrew Main (Zefram) <zefram@fysh.org>

COPYRIGHT

Copyright (C) 2010 PhotoBox Ltd

Copyright (C) 2011 Andrew Main (Zefram) <zefram@fysh.org>

LICENSE

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.




Hosting generously
sponsored by Bytemark