The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

XML::Compile::Schema - Compile a schema into CODE

INHERITANCE

 XML::Compile::Schema
   is a XML::Compile

 XML::Compile::Schema is extended by
   XML::Compile::Cache

SYNOPSIS

 # compile tree yourself
 my $parser = XML::LibXML->new;
 my $tree   = $parser->parse...(...);
 my $schema = XML::Compile::Schema->new($tree);

 # get schema from string
 my $schema = XML::Compile::Schema->new($xml_string);

 # get schema from file
 my $schema = XML::Compile::Schema->new($filename);

 # adding schemas
 $schema->addSchemas($tree);

 # three times the same: well-known url, filename in schemadir, url
 $schema->importDefinitions('http://www.w3.org/2001/XMLSchema');
 $schema->importDefinitions('2001-XMLSchema.xsd');
 $schema->importDefinitions(SCHEMA2001);  # from ::Util

 # alternatively
 my @specs  = ('one.xsd', 'two.xsd', $schema_as_string);
 my $schema = XML::Compile::Schema->new(\@specs); # ARRAY!

 # see what types are defined
 $schema->printIndex;

 # create and use a reader
 use XML::Compile::Util qw/pack_type/;
 my $elem   = pack_type 'my-namespace', 'my-local-name';
                # $elem eq "{my-namespace}my-local-name"
 my $read   = $schema->compile(READER => $elem);
 my $data   = $read->($xmlnode);
 my $data   = $read->("filename.xml");
 
 # when you do not know the element type beforehand
 use XML::Compile::Util qw/type_of_node/;
 my $elem   = type_of_node $xml->documentElement;
 my $reader = $reader_cache{$type}               # either exists
          ||= $schema->compile(READER => $elem); #   or create
 my $data   = $reader->($xmlmsg);
 
 # create and use a writer
 my $doc    = XML::LibXML::Document->new('1.0', 'UTF-8');
 my $write  = $schema->compile(WRITER => '{myns}mytype');
 my $xml    = $write->($doc, $hash);
 my $result = $doc->setDocumentElement($xml);

 # show result
 print $xml->toString;

 # to create the type nicely
 use XML::Compile::Util qw/pack_type/;
 my $type   = pack_type 'myns', 'mytype';
 print $type;  # shows  {myns}mytype

 # using a compiled routines cache
 use XML::Compile::Cache;   # seperate distribution
 my $schema = XML::Compile::Cache->new(...);

 # Error handling tricks with Log::Report
 use Log::Report mode => 'DEBUG';  # enable debugging
 dispatcher SYSLOG => 'syslog';    # errors to syslog as well
 try { $reader->($data) };         # catch errors in $@

DESCRIPTION

This module collects knowledge about one or more schemas. The most important method provided is compile(), which can create XML file readers and writers based on the schema information and some selected element or attribute type.

Various implementations use the translator, and more can be added later:

$schema->compile('READER'...) translates XML to HASH

The XML reader produces a HASH from a XML::LibXML::Node tree or an XML string. Those represent the input data. The values are checked. An error produced when a value or the data-structure is not according to the specs.

The CODE reference which is returned can be called with anything accepted by dataToXML().

example: create an XML reader

 my $msgin  = $rules->compile(READER => '{myns}mytype');
 # or  ...  = $rules->compile(READER => pack_type('myns', 'mytype'));
 my $xml    = $parser->parse("some-xml.xml");
 my $hash   = $msgin->($xml);

or

 my $hash   = $msgin->('some-xml.xml');
 my $hash   = $msgin->($xml_string);
 my $hash   = $msgin->($xml_node);
$schema->compile('WRITER', ...) translates HASH to XML

The writer produces schema compliant XML, based on a Perl HASH. To get the data encoding correctly, you are required to pass a document object in which the XML nodes may get a place later.

example: create an XML writer

 my $doc    = XML::LibXML::Document->new('1.0', 'UTF-8');
 my $write  = $schema->compile(WRITER => '{myns}mytype');
 my $xml    = $write->($doc, $hash);
 print $xml->toString;
 

alternative

 my $write  = $schema->compile(WRITER => 'myns#myid');
$schema->template('XML', ...) creates an XML example

Based on the schema, this produces an XML message as example. Schemas are usually so complex that people loose overview. This example may put you back on track, and used as starting point for many creating the XML version of the message.

$schema->template('PERL', ...) creates an Perl example

Based on the schema, this produces an Perl HASH structure (a bit like the output by Data::Dumper), which can be used as template for creating messages. The output contains documentation, and is usually much clearer than the schema itself.

Be warned that the schema is not validated; you can develop schemas which do work well with this module, but are not valid according to W3C. In many cases, however, the translater will refuse to accept mistakes: mainly because it cannot produce valid code.

METHODS

Constructors

XML::Compile::Schema->new([XMLDATA], OPTIONS)

    Details about many name-spaces can be organized with only a single schema object (actually, the data is administered in an internal XML::Compile::Schema::NameSpaces object)

    The initial information is extracted from the XMLDATA source. The XMLDATA can be anything what is acceptable by importDefinitions(), which is everything accepted by dataToXML() or an ARRAY of those things.

    You can specify the hooks before you define the schemas the hooks work on: all schema information and all hooks are only used when the readers and writers get compiled.

     Option            --Defined in     --Default
     hook                                 undef
     hooks                                []
     ignore_unused_tags                   <false>
     key_rewrite                          []
     schema_dirs         XML::Compile     undef
     typemap                              {}

    . hook => ARRAY-WITH-HOOKDATA | HOOK

    . hooks => ARRAY-OF-HOOK

    . ignore_unused_tags => BOOLEAN|REGEXP

      (WRITER) Usually, a mistake warning is produced when a user provides a data structure which contains more data than is needed for the XML message which is created; this will show structural problems. However, in some cases, you may want to play tricks with the data-structure and therefore disable this precausion.

      With a REGEXP, you can have more control. Only keys which do match the expression will be ignored silently. Other keys (usually typos and other mistakes) will get reported. See "Typemaps"

    . key_rewrite => HASH|CODE|ARRAY-of-HASH-and-CODE

      Translate XML keys into different Perl keys. See "Key rewrite".

    . schema_dirs => DIRECTORY|ARRAY-OF-DIRECTORIES

    . typemap => HASH

      HASH of Schema type to Perl object or Perl class. See "Typemaps", the serialization of objects.

Accessors

$obj->addHook(HOOKDATA|HOOK|undef)

    HOOKDATA is a LIST of options as key-value pairs, HOOK is a HASH with the same data. undef is ignored. See addHooks() and "Schema hooks" below.

$obj->addHooks(HOOK, [HOOK, ...])

    Add multiple hooks at once. These must all be HASHes. See "Schema hooks" and addHook(). undef values are ignored.

$obj->addKeyRewrite(PREDEF|CODE|HASH, ...)

    Add new rewrite rules to the existing list (initially provided with new(key_rewrite)). The whole list of rewrite rules is returned.

    PREFIXED rules will be applied first. Special care is taken that the prefix will not be called twice. The last added set of rewrite rules will be applied first. See "Key rewrite".

$obj->addSchemaDirs(DIRECTORIES|FILENAME)

XML::Compile::Schema->addSchemaDirs(DIRECTORIES|FILENAME)

$obj->addSchemas(XML, OPTIONS)

    Collect all the schemas defined in the XML data. The XML parameter must be a XML::LibXML node, therefore it is adviced to use importDefinitions(), which has a much more flexible way to specify the data.

     Option                --Default
     attribute_form_default  <undef>
     element_form_default    <undef>
     filename                undef
     source                  undef

    . attribute_form_default => 'qualified'|'unqualified'

    . element_form_default => 'qualified'|'unqualified'

      Overrule the default as found in the schema. Many old schemas (like WSDL11 and SOAP11) do not specify the correct default element form in the schema but only in the text.

    . filename => FILENAME

      Explicitly state from which file the data is coming.

    . source => STRING

      An indication where this schema data was found. If you use dataToXML() in LIST context, you get such an indication.

$obj->addTypemap(PAIR)

$obj->addTypemaps(PAIRS)

    Add new XML-Perl type relations. See "Typemaps".

$obj->hooks

    Returns the LIST of defined hooks (as HASHes).

$obj->useSchema(SCHEMA, [SCHEMA])

    Pass a XML::Compile::Schema object, or extensions like XML::Compile::Cache, to be used as definitions as well. First, elements are looked-up in the current schema definition object. If not found the other provided SCHEMA objects are checked in the order as they were added.

    Searches for definitions do not recurse into schemas which are used by the used schema.

    example: use other Schema

      my $wsdl = XML::Compile::WSDL->new($wsdl);
      my $geo  = Geo::GML->new(version => '3.2.1');
      # both $wsdl and $geo extend XML::Compile::Schema
    
      $wsdl->useSchema($geo);

Compilers

$obj->compile(('READER'|'WRITER'), TYPE, OPTIONS)

    Translate the specified ELEMENT (found in one of the read schemas) into a CODE reference which is able to translate between XML-text and a HASH. When the TYPE is undef, an empty LIST is returned.

    The indicated TYPE is the starting-point for processing in the data-structure, a toplevel element or attribute name. The name must be specified in {url}name format, there the url is the name-space. An alternative is the url#id which refers to an element or type with the specific id attribute value.

    When a READER is created, a CODE reference is returned which needs to be called with XML, as accepted by XML::Compile::dataToXML(). Returned is a nested HASH structure which contains the data from contained in the XML. The transformation rules are explained below.

    When a WRITER is created, a CODE reference is returned which needs to be called with an XML::LibXML::Document object and a HASH, and returns a XML::LibXML::Node.

    Most options below are explained in more detailed in the manual-page XML::Compile::Translate, which implements the compilation.

     Option                        --Default
     any_attribute                   undef
     any_element                     undef
     attributes_qualified            <undef>
     check_occurs                    <true>
     check_values                    <true>
     default_values                  <depends on backend>
     elements_qualified              <undef>
     hook                            undef
     hooks                           undef
     ignore_facets                   <false>
     ignore_unused_tags              <false>
     include_namespaces              <true>
     interpret_nillable_as_optional  <false>
     key_rewrite                     []
     mixed_elements                  'ATTRIBUTES'
     namespace_reset                 <false>
     output_namespaces               undef
     path                            <expanded name of type>
     permit_href                     <false>
     prefixes                        {}
     sloppy_floats                   <false>
     sloppy_integers                 <false>
     typemap                         {}
     use_default_namespace           <false>
     validation                      <true>

    . any_attribute => CODE|'TAKE_ALL'|'SKIP_ALL'

      [0.89] In general, anyAttribute schema components cannot be handled automatically. If you need to create or process anyAttribute information, then read about wildcards in the DETAILS chapter of the manual-page for the specific back-end. Before release 0.89 this option was named anyElement, which will still work.

    . any_element => CODE|'TAKE_ALL'|'SKIP_ALL'

      [0.89] In general, any schema components cannot be handled automatically. If you need to create or process any information, then read about wildcards in the DETAILS chapter of the manual-page for the specific back-end. Before release 0.89 this option was named anyElement, which will still work.

    . attributes_qualified => BOOLEAN

      When defined, this will overrule the attributeFormDefault flags in all schemas. When not qualified, the xml will not produce nor process prefixes on attributes.

    . check_occurs => BOOLEAN

      Whether code will be produced to do bounds checking on elements and blocks which may appear more than once. When the schema says that maxOccurs is 1, then that element becomes optional. When the schema says that maxOccurs is larger than 1, then the output is still always an ARRAY, but now of unrestricted length.

    . check_values => BOOLEAN

      Whether code will be produce to check that the XML fields contain the expected data format.

      Turning this off will improve the processing speed significantly, but is (of course) much less safe. Do not set it off when you expect data from external sources: validation is a crucial requirement for XML.

    . default_values => 'MINIMAL'|'IGNORE'|'EXTEND'

      How to treat default values as provided by the schema. With IGNORE (the writer default), you will see exactly what is specified in the XML or HASH. With EXTEND (the reader default) will show the default and fixed values in the result. MINIMAL does remove all fields which are the same as the default setting: simplifies. See "Default Values".

    . elements_qualified => TOP|ALL|NONE|BOOLEAN

      When defined, this will overrule the namespace use on elements in all schemas. When TOP is specified, at least the top-element will be name-space qualified. When ALL or a true value is given, then all elements will be used qualified. When NONE or a false value is given, the XML will not produce or process prefixes on the elements.

      The form attributes will be respected, except on the top element when TOP is specified. Use hooks when you need to fix name-space use in more subtile ways.

      With XML::Compile::Schema subroutine importDefinitions option element_form_default, you can correct whole schema's about their name-space behavior.

    . hook => HOOK|ARRAY-OF-HOOKS

      Define one or more processing hooks. See "Schema hooks" below. These hooks are only active for this compiled entity, where addHook() and addHooks() can be used to define hooks which are used for all results of compile(). The hooks specified with the hook or hooks option are run before the global definitions.

    . hooks => HOOK|ARRAY-OF-HOOKS

      Alternative for option hook.

    . ignore_facets => BOOLEAN

      Facets influence the formatting and range of values. This does not come cheap, so can be turned off. It affects the restrictions set for a simpleType. The processing speed will improve, but validation is a crucial requirement for XML: please do not turn this off when the data comes from external sources.

    . ignore_unused_tags => BOOLEAN|REGEXP

    . include_namespaces => BOOLEAN

      Indicates whether the WRITER should include the prefix to namespace translation on the top-level element of the returned tree. If not, you may continue with the same name-space table to combine various XML components into one, and add the namespaces later. No namespace definition can be added the production rule produces an attribute.

    . interpret_nillable_as_optional => BOOLEAN

      Found in the schema wild-life: people who think that nillable means optional. Not too hard to fix. For the WRITER, you still have to state NIL explicitly, but the elements are not constructed. The READER will output NIL when the nillable elements are missing.

    . key_rewrite => HASH|CODE|ARRAY-of-HASH-and-CODE

    . mixed_elements => CODE|PREDEFINED

      What to do when mixed schema elements are to be processed. Read more in the "DETAILS" section below.

    . namespace_reset => BOOLEAN

      Use the same prefixes in prefixes as with some other compiled piece, but reset the counts to zero first.

    . output_namespaces => HASH|ARRAY-of-PAIRS

      Pre release 0.87 name for the prefixes option. Deprecated.

    . path => STRING

      Prepended to each error report, to indicate the location of the error in the XML-Scheme tree.

    . permit_href => BOOLEAN

      When parsing SOAP-RPC encoded messages, the elements may have a href attribute, pointing to an object with id. The READER will return the unparsed, unresolved node when the attribute is detected, and the SOAP-RPC decoder will have to discover and resolve it.

    . prefixes => HASH|ARRAY-of-PAIRS

      Can be used to pre-define prefixes for namespaces (for 'WRITER' or key rewrite) for instance to reserve common abbreviations like soap for external use. Each entry in the hash has as key the namespace uri. The value is a hash which contains uri, prefix, and used fields. Pass a reference to a private hash to catch this index. An ARRAY with prefix, uri PAIRS is simpler.

       prefixes => [ mine => $myns, two => $twons ]
       prefixes => { $myns => 'mine', $twons => 'two' }
      
       # the previous is short for:
       prefixes => { $myns => [ uri => $myns, prefix => 'mine', used => 0 ]
                   , $twons => [ uri => $twons, prefix => 'two', ...] };

    . sloppy_floats => BOOLEAN

      The float types of XML are all quite big, and support NaN, INF, and -INF. Perl's normal floats do not, and therefore Math::BigFloat is used. This, however, is slow. When true, you will crash on any value which is not understood by Perl's default float... but run much faster. See also sloppy_integers.

    . sloppy_integers => BOOLEAN

      The XML integer data-types must support at least 18 digits, which is larger than Perl's 32 bit internal integers. Therefore, the implementation will use Math::BigInt objects to handle them. However, often an simple int type whould have sufficed, but the XML designer was lazy. A long is much faster to handle. Set this flag to use int as fast (but inprecise) replacements.

      Be aware that Math::BigInt and Math::BigFloat objects are nearly but not fully transparently mimicing the behavior of Perl's ints and floats. See their respective manual-pages. Especially when you wish for some performance, you should optimize access to these objects to avoid expensive copying which is exactly the spot where the differences are.

      You can also improve the speed of Math::BigInt by installing Math::BigInt::GMP. Add use Math::BigInt try => 'GMP'; to the top of your main script to get more performance.

    . typemap => HASH

    . use_default_namespace => BOOLEAN

      [0.91] When mixing qualified and unqualified namespaces, then the use of a default namespace can be quite confusing: a name-space without prefix. Therefore, by default, all qualified elements will have an explicit prefix.

    . validation => BOOLEAN

      XML message must be validated, to lower the chance on abuse. However, of course, it costs performance which is only partially compensated by fewer checks in your code. This flag overrules the check_values, check_occurs, and ignore_facets.

XML::Compile::Schema->dataToXML(NODE|REF-XML-STRING|XML-STRING|FILENAME|FILEHANDLE|KNOWN)

$obj->template('XML'|'PERL', TYPE, OPTIONS)

    Schema's can be horribly complex and unreadible. Therefore, this template method can be called to create an example which demonstrates how data of the specified TYPE as XML or Perl is organized in practice.

    Some OPTIONS are explained in XML::Compile::Translate. There are some extra OPTIONS defined for the final output process.

    The templates produced are not always correct. Please contribute improvements.

     Option              --Default
     attributes_qualified  <undef>
     elements_qualified    <undef>
     include_namespaces    <true>
     indent                " "
     show_comments         ALL

    . attributes_qualified => BOOLEAN

    . elements_qualified => ALL|TOP|NONE|BOOLEAN

    . include_namespaces => BOOLEAN

    . indent => STRING

      The leading indentation string per nesting. Must start with at least one blank.

    . show_comments => STRING|'ALL'|'NONE'

      A comma seperated list of tokens, which explain what kind of comments need to be included in the output. The available tokens are: struct, type, occur, facets. A value of ALL will select all available comments. The NONE or empty string will exclude all comments.

Administration

$obj->elements

    List all elements, defined by all schemas sorted alphabetically.

$obj->findSchemaFile(FILENAME)

XML::Compile::Schema->findSchemaFile(FILENAME)

$obj->importDefinitions(XMLDATA, OPTIONS)

    Import (include) the schema information included in the XMLDATA. The XMLDATA must be acceptable for dataToXML(). The resulting node and all the OPTIONS are passed to addSchemas(). The schema node does not need to be the top element: any schema node found in the data will be decoded.

    Returned is a list of XML::Compile::Schema::Instance objects, for each processed schema component.

    If your program imports the same string or file definitions multiple times, it will re-use the schema information from the first import. This removal of dupplications will not work for open files or pre-parsed XML structures.

    As an extension to the handling dataToXML() provides, you can specify an ARRAY of things which are acceptable to dataToXML. This way, you can specify multiple resources at once, each of which will be processed with the same OPTIONS.

     Option --Default
     details  <from XMLDATA>

    . details => HASH

      Overrule the details information about the source of the data.

    example: of use of importDefinitions

      my $schema = XML::Compile::Schema->new;
      $schema->importDefinitions('my-spec.xsd');
    
      my $other = "<schema>...</schema>";  # use 'HERE' documents!
      my @specs = ('my-spec.xsd', 'types.xsd', $other);
      $schema->importDefinitions(\@specs, @options);

$obj->knownNamespace(NAMESPACE|PAIRS)

XML::Compile::Schema->knownNamespace(NAMESPACE|PAIRS)

$obj->namespaces

$obj->printIndex([FILEHANDLE], OPTIONS)

$obj->types

    List all types, defined by all schemas sorted alphabetically.

$obj->walkTree(NODE, CODE)

DETAILS

Comparison

Collecting definitions

When starting an application, you will need to read the schema definitions. This is done by instantiating an object via XML::Compile::Schema::new() or XML::Compile::WSDL11::new(). The WSDL11 object has a schema object internally.

Schemas may contains import and include statements, which specify other resources for definitions. In the idea of the XML design team, those files should be retrieved automatically via an internet connection from the schemaLocation. However, this is a bad concept; in XML::Compile modules you will have to explictly provide filenames on local disk using importDefinitions() or XML::Compile::WSDL11::addWSDL().

There are various reasons why I, the author of this module, think the dynamic automatic internet imports are a bad idea. First: you do not always have a working internet connection (travelling with a laptop in a train). Your implementation should work the same way under all environmental circumstances! Besides, I do not trust remote files on my system, without inspecting them. Most important: I want to run my regression tests before using a new version of the definitions, so I do not want to have a remote server change the agreements without my knowledge.

So: before you start, you will need to scan (recursively) the initial schema or wsdl file for import and include statements, and collect all these files from their schemaLocation into files on local disk. In your program, call importDefinitions() on all of them -in any order- before you call compile().

Organizing your definitions

One nice feature to help you organize (especially useful when you package your code in a distribution), is to add these lines to the beginning of your code:

  package My::Package;
  XML::Compile->addSchemaDirs(__FILE__);
  XML::Compile->knownNamespace('http://myns' => 'myns.xsd', ...);

Now, if the package file is located at SomeThing/My/Package.pm, the definion of the namespace should be kept in SomeThing/My/Package/xsd/myns.xsd.

Somewhere in your program, you have to load these definitions:

  # absolute or relative path is always possible
  $schema->importDefinitions('SomeThing/My/Package/xsd/myns.xsd');

  # relative search path extended by addSchemaDirs
  $schema->importDefinitions('myns.xsd');

  # knownNamespace improves abstraction
  $schema->importDefinitions('http://myns');

Very probably, the namespace is already in some variable:

  use XML::Compile::Schema;
  use XML::Compile::Util  'pack_type';

  my $myns   = 'http://some-very-long-uri';
  my $schema = XML::Compile::Schema->new($myns);
  my $mytype = pack_type $myns, $myelement;
  my $reader = $schema->compileClient(READER => $mytype);

Addressing components

Normally, external users can only address elements within a schema, and types are hidden to be used by other schemas only. For this reason, it is permitted to create an element and a type with the same name.

The compiler requires a starting-point. This can either be an element name or an element's id. The format of the element name is {namespace-uri}localname, for instance

 {http://library}book

You may also start with

 http://www.w3.org/2001/XMLSchema#float

as long as this ID refers to a top-level element, not a type.

When you use a schema without targetNamespace (which is bad practice, but sometimes people really do not understand the beneficial aspects of the use of namespaces) then the elements can be addressed as {}name or simple name.

Representing data-structures

The code will do its best to produce a correct translation. For instance, an accidental 1.9999 will be converted into 2 when the schema says that the field is an int. It will also strip superfluous blanks when the data-type permits. Especially watch-out for the Integer types, which produce Math::BigInt objects unless compile(sloppy_integers) is used.

Elements can be complex, and themselve contain elements which are complex. In the Perl representation of the data, this will be shown as nested hashes with the same structure as the XML.

You should not take tare of character encodings, whereas XML::LibXML is doing that for us: you shall not escape characters like "<" yourself.

The schemas define kinds of data types. There are various ways to define them (with restrictions and extensions), but for the resulting data structure is that knowledge not important.

simpleType

A single value. A lot of single value data-types are built-in (see XML::Compile::Schema::BuiltInTypes).

Simple types may have range limiting restrictions (facets), which will be checked by default. Types may also have some white-space behavior, for instance blanks are stripped from integers: before, after, but also inside the number representing string.

Note that some of the reader hooks will alter the single value of these elements into a HASH like used for the complexType/simpleContent (next paragraph), to be able to return some extra collected information.

example: typical simpleType

In XML, it looks like this:

 <test1>42</test1>

In the HASH structure, the data will be represented as

 test1 => 42

With reader hook after => 'XML_NODE' hook applied, it will become

 test1 => { _ => 42
          , _XML_NODE => $obj
          }
 
complexType/simpleContent

In this case, the single value container may have attributes. The number of attributes can be endless, and the value is only one. This value has no name, and therefore gets a predefined name _.

example: typical simpleContent example

In XML, this looks like this:

 <test2 question="everything">42</test2>

As a HASH, this looks like

 test2 => { _ => 42
          , question => 'everything'
          }
complexType and complexType/complexContent

These containers not only have attributes, but also multiple values as content. The complexContent is used to create inheritance structures in the data-type definition. This does not affect the XML data package itself.

example: typical complexType element

The XML could look like:

 <test3 question="everything" by="mouse">
   <answer>42</answer>
   <when>5 billion BC</when>
 </test3>

Represented as HASH, this looks like

 test3 => { question => 'everything'
          , by       => 'mouse'
          , answer   => 42
          , when     => '5 billion BC'
          }
anything by XML NODE

For a WRITER, you may also specify a XML::LibXML::Node anywhere.

 test1 => $doc->createTextNode('42');
 test3 => $doc->createElement('ariba');

This data-structure is used without validation, so you are fully on your own with this one.

Processing

A second factor which determines the data-structure is the element occurrence. Usually, elements have to appear once and exactly once on a certain location in the XML data structure. This order is automatically produced by this module. But elements may appear multiple times.

usual case

The default behavior for an element (in a sequence container) is to appear exactly once. When missing, this is an error.

maxOccurs larger than 1

In this case, the element or particle block can appear multiple times. Multiple values are kept in an ARRAY within the HASH. Non-schema based XML modules do not return a single value as an ARRAY, which makes that code more complicated. But in our case, we know the expected amount beforehand.

When the maxOccurs larger than 1 is specified for an element, an ARRAY of those elements is produced. When it is specified for a block (sequence, choice, all, group), then an ARRAY of HASHes is returned. See the special section about this subject.

An error is produced when the number of elements found is less than minOccurs (defaults to 1) or more than maxOccurs (defaults to 1), unless compile(check_occurs) is false.

example: elements with maxOccurs larger than 1

In the schema: <element name="a" type="int" maxOccurs="unbounded" /> <element name="b" type="int" />

In the XML message: <a>12</a><a>13</a><b>14</b>

In the Perl representation: a => [12, 13], b => 14

value is NIL

When an element is nillable, that is explicitly represented as a NIL constant string.

use="optional" or minOccurs="0"

The element may be skipped. When found it is a single value.

use="forbidden"

When the element is found, an error is produced.

default="value"

When the XML does not contain the element, the default value is used... but only if this element's container exists. This has no effect on the writer.

fixed="value"

Produce an error when the value is not present or different (after the white-space rules where applied).

Repetative blocks

Particle blocks come in four shapes: sequence, choice, all, and group (an indirect block). This also affects substitutionGroups.

repetative sequence, choice, all

In situations like this:

  <element name="example">
    <complexType>
      <sequence>
        <element name="a" type="int" />
        <sequence>
          <element name="b" type="int" />
        </sequence>
        <element name="c" type="int" />
      </sequence>
    </complexType>
  </element>

(yes, schemas are verbose) the data structure is

  <example> <a>1</a> <b>2</b> <c>3</c> </example>

the Perl representation is flattened, into

  example => { a => 1, b => 2, c => 3 }

Ok, this is very simple. However, schemas can use repetition:

  <element name="example">
    <complexType>
      <sequence>
        <element name="a" type="int" />
        <sequence minOccurs="0" maxOccurs="unbounded">
          <element name="b" type="int" />
        </sequence>
        <element name="c" type="int" />
      </sequence>
    </complexType>
  </element>

The XML message may be:

  <example> <a>1</a> <b>2</b> <b>3</b> <b>4</b> <c>5</c> </example>

Now, the perl representation needs to produce an array of the data in the repeated block. This array needs to have a name, because more of these blocks may appear together in a construct. The name of the block is derived from the type of block and the name of the first element in the block, regardless whether that element is present in the data or not.

So, our example data is translated into (and vice versa)

  example =>
    { a     => 1
    , seq_b => [ {b => 2}, {b => 3}, {b => 4} ]
    , c     => 5
    }

The following label is used, based on the name of the first element (say xyz) as defined in the schema (not in the actual message): seq_xyz sequence with maxOccurs > 1 cho_xyz choice with maxOccurs > 1 all_xyz all with maxOccurs > 1

When you have compile(key_rewrite) option PREFIXED, and you have explicitly assigned the prefix xs to the schema namespace (See compile(prefixes)), then those names will respectively be seq_xs_xyz, cho_xs_xyz, all_xs_xyz.

example: always an array with maxOccurs larger than 1

Even when there is only one element found, it will be returned as ARRAY (of one element). Therefore, you can write

 my $data = $reader->($xml);
 foreach my $a ( @{$data->{a}} ) {...}

example: blocks with maxOccurs larger than 1

In the schema: <sequence maxOccurs="5"> <element name="a" type="int" /> <element name="b" type="int" /> </sequence>

In the XML message: <a>15</a><b>16</b><a>17</a><b>18</b>

In Perl representation: seq_a => [ {a => 15, b => 16}, {a => 17, b => 18} ]

repetative groups

[behavioral change in 0.93] In contrast to the normal partical blocks, as described above, do the groups have names. In this case, we do not need to take the name of the first element, but can use the group name. It will still have gr_ appended, because groups can have the same name as an element or a type(!)

Blocks within the group definition cannot be repeated.

example: groups with maxOccurs larger than 1

 <element name="top">
   <complexType>
     <sequence>
       <group ref="ns:xyz" maxOccurs="unbounded">
     </sequence>
   </complexType>
 </element>

 <group name="xyz">
   <sequence>
     <element name="a" type="int" />
     <element name="b" type="int" />
   </sequence>
 </group>

translates into

  gr_xyz => [ {a => 42, b => 43}, {a => 44, b => 45} ]

repetative substitutionGroups

For substitutionGroups which are repeating, the name of the base element is used (the element which has attribute <abstract="true">. We do need this array, because the order of the elements within the group may be important; we cannot group the elements based to the extended element's name.

In an example substitutionGroup, the Perl representation will be something like this:

  base-element-name =>
    [ { extension-name  => $data1 }
    , { other-extension => $data2 }
    ]

Each HASH has only one key.

List type

List simpleType objects are also represented as ARRAY, like elements with a minOccurs or maxOccurs unequal 1.

example: with a list of ints

  <test5>3 8 12</test5>

as Perl structure:

  test5 => [3, 8, 12]

substitutionGroup

A substitution group is kind-of choice between alternative (complex) types. However, in this case roles have reversed: instead a choice which lists the alternatives, here the alternative elements register themselves as valid for an abstract (head) element. All alternatives should be extensions of the head element's type, but there is no way to check that.

example: substitutionGroup

 <xs:element name="price"  type="xs:int" abstract="true" />
 <xs:element name="euro"   type="xs:int" substitutionGroup="price" />
 <xs:element name="dollar" type="xs:int" substitutionGroup="price" />

 <xs:element name="product">
   <xs:complexType>
      <xs:element name="name" type="xs:string" />
      <xs:element ref="price" />
   </xs:complexType>
 </xs:element>
 

Now, valid XML data is

 <product>
   <name>Ball</name>
   <euro>12</euro>
 </product>

and

 <product>
   <name>Ball</name>
   <dollar>6</dollar>
 </product>

The HASH repesentation is respectively

 product => {name => 'Ball', euro  => 12}
 product => {name => 'Ball', dollar => 6}

Wildcards

The any and anyAttribute elements are referred to as wildcards: they specify groups of elements and attributes which can be used, in stead of being explicit.

The author of this module advices against the use of wildcards in schemas, because the purpose of schemas is to be explicit about the structure of the message, and that basic idea is simply thrown away by these wildcards. Let people cleanly extend the schema with inheritance! If you use a standard schema which facilitates these wildcards, then please do not use them!

Because wildcards are not explicit about the types to expect, the XML::Compile module can not prepare for them automatically. However, as user of the schema you probably know better about the possible contents of these fields. Therefore, you can translate that knowledge into code explicitly. Read about the processing of wildcards in the manual page for each of the back-ends, because it is different in each case.

ComplexType with "mixed" attribute

[largely improved in 0.86, reader only] ComplexType and ComplexContent components can be declared with the <mixed="true"> attribute. This implies that text is not limited to the content of containers, but may also be used inbetween elements. Usually, you will only find ignorable white-space between elements.

In this example, the a container is marked to be mixed: <a> before <b>2</b> after </a>

Each back-end has its own way of handling mixed elements. The compile(mixed_elements) currently only modifies the reader's behavior; the writer's capabilities are limited. See XML::Compile::Translate::Reader.

Schema hooks

You can use hooks, for instance, to block processing parts of the message, to create work-arounds for schema bugs, or to extract more information during the process than done by default.

defining hooks

Multiple hooks can active during the compilation process of a type, when compile() is called. During Schema translation, each of the hooks is checked for all types which are processed. When multiple hooks select the object to get a modified behavior, then all are evaluated in order of definition.

Defining a global hook (where HOOKDATA is the LIST of PAIRS with hook parameters, and HOOK a HASH with such HOOKDATA):

 my $schema = XML::Compile::Schema->new
  ( ...
  , hook  => HOOK
  , hooks => [ HOOK, HOOK ]
  );

 $schema->addHook(HOOKDATA | HOOK);
 $schema->addHooks(HOOK, HOOK, ...);

 my $wsdl   = XML::Compile::WSDL->new(...);
 $wsdl->schemas->addHook(HOOKDATA | HOOK);

local hooks are only used for one reader or writer. They are evaluated before the global hooks.

 my $reader = $schema->compile(READER => $type
  , hook => HOOK, hooks => [ HOOK, HOOK, ...]);

example: of HOOKs:

 my $hook = { type    => '{my_ns}my_type'
            , before  => sub { ... }
            };

 my $hook = { path    => qr/\(volume\)/
            , replace => 'SKIP'
            };

 # path contains "volume" or id is 'aap' or id is 'noot'
 my $hook = { path    => qr/\bvolume\b/
            , id      => [ 'aap', 'noot' ]
            , before  => [ sub {...}, sub { ... } ]
            , after   => sub { ... }
            };

general syntax

Each hook has two kinds of parameters: selectors and processors. Selectors define the schema component of which the processing is modified. When one of the selectors matches, the processing information for the hook is used. When no selector is specified, then the hook will be used on all elements.

Available selectors (see below for details on each of them):

. type
. id
. path

As argument, you can specify one element as STRING, a regular expression to select multiple elements, or an ARRAY of STRINGs and REGEXes.

Next to where the hook is placed, we need to known what to do in the case: the hook contains processing information. When more than one hook matches, then all of these processors are called in order of hook definition. However, first the compile hooks are taken, and then the global hooks.

How the processing works exactly depends on the compiler back-end. There are major differences. Each of those manual-pages lists the specifics. The label tells us when the processing is initiated. Available labels are before, replace, and after.

hooks on matching types

The type selector specifies a complexType of simpleType by name. Best is to base the selection on the full name, like {ns}type, which will avoid all kinds of name-space conflicts in the future. However, you may also specify only the type (in any name-space). Any REGEX will be matched to the full type name. Be careful with the pattern archors.

If you use XML::Compile::Cache [release 0.90], then you can use prefix:type as type specification as well. You have to explicitly define prefix to namespace beforehand.

example: use of the type selector

 type => 'int'
 type => '{http://www.w3.org/2000/10/XMLSchema}int'
 type => qr/\}xml_/   # type start with xml_
 type => [ qw/int float/ ];

 use XML::Compile::Util qw/pack_type SCHEMA2000/;
 type => pack_type(SCHEMA2000, 'int')

example: type hook with XML::Compile::Cache

 use XML::Compile::Util qw/SCHEMA2001/;
 my $schemas = XML::Compile::Cache->new(...);
 $schemas->prefixes(xsd => SCHEMA2001, mine => 'http://somens');
 $schemas->addHook(type => 'xsd:int', ...);
 $schemas->addHook(type => 'mine:sometype', ...);

hooks on matching ids

Matching based on IDs can reach more schema elements: some types are anonymous but still have an ID. Best is to base selection on the full ID name, like ns#id, to avoid all kinds of name-space conflicts in the future.

example: use of the ID selector

 # default schema types have id's with same name
 id => 'ABC'
 id => 'http://www.w3.org/2001/XMLSchema#int'
 id => qr/\#xml_/   # id which start with xml_
 id => [ qw/ABC fgh/ ];

 use XML::Compile::Util qw/pack_id SCHEMA2001/;
 id => pack_id(SCHEMA2001, 'ABC')

hooks on matching paths

When you see error messages, you always see some representation of the path where the problem was discovered. You can use this path as selector, when you know what it is... BE WARNED, that the current structure of the path is not really consequent hence will be improved in one of the future releases, breaking backwards compatibility.

Typemaps

Often, XML will be used in object oriented programs, where the facts which are transported in the XML message are attributes of Perl objects. Of course, you can always collect the data from each of the Objects into the required (huge) HASH manually, before triggering the reader or writer. As alternative, you can connect types in the XML schema with Perl objects and classes, which results in cleaner code.

You can also specify typemaps with new(typemap), addTypemaps(), and compile(typemap). Each type will only refer to the last map for that type. When an undef is given for a type, then the older definition will be cancelled. Examples of the three ways to specify typemaps:

  my %map = ($x1 => $p1, $x2 => $p2);
  my $schema = XML::Compile::Schema->new(...., typemap => \%map);

  $schema->addTypemaps($x3 => $p3, $x4 => $p4, $x1 => undef);

  my $call = $schema->compile(READER => $type, typemap => \%map);

The latter only has effect for the type being compiled. The definitions are cumulative. In the second example, the $x1 gets disabled.

Objects can come in two shapes: either they do support the connection with XML::Compile (implementing two methods with predefined names), or they don't, in which case you will need to write a little wrapper.

  use XML::Compile::Util qw/pack_type/;
  my $t1 = pack_type $myns, $mylocal;
  $schema->typemap($t1 => 'My::Perl::Class');
  $schema->typemap($t1 => $some_object);
  $schema->typemap($t1 => sub { ... });

The implementation of the READER and WRITER differs. In the READER case, the typemap is implemented as an 'after' hook which calls a fromXML method. The WRITER is a 'before' hook which calls a toXML method. See respectively the XML::Compile::Translate::Reader and XML::Compile::Translate::Writer.

Private variables in objects

When you design a new object, it is possible to store the information exactly like the corresponding XML type definition. The only thing the fromXML has to do, is bless the data-structure into its class:

  $schema->typemap($xmltype => 'My::Perl::Class');
  package My::Perl::Class;
  sub fromXML { bless $_[1], $_[0] } # for READER
  sub toXML   { $_[0] }              # for WRITER

However... the object may also need so need some private variables. If you store them in the same HASH for your object, you will get "unused tags" warnings from the writer. To avoid that, choose one of the following alternatives:

  # never complain about unused tags
  ::Schema->new(..., ignore_unused_tags => 1);

  # only complain about unused tags not matching regexp
  my $not_for_xml = qr/^[A-Z]/;  # my XML only has lower-case
  ::Schema->new(..., ignore_unused_tags => $not_for_xml);

  # only for one compiled WRITER (not used with READER)
  ::Schema->compile(..., ignore_unused_tags => 1);
  ::Schema->compile(..., ignore_unused_tags => $not_for_xml);

Typemap limitations

There are some things you need to know:

.

Many schemas define very complex types. These may often not translate cleanly into objects. You may need to create a typemap relation for some parent type. The CODE reference may be very useful in this case.

.

A same kind of problem appears when you have a list in your object, which often is not named in the schema.

Key rewrite

[release 0.87, improved 1.01] The standard practice is to use the localName of the XML elements as key in the Perl HASH; the key rewrite mechanism is used to change that, sometimes to seperate elements which have the same localName within different name-spaces, in other cases just for fun or convenience.

Rewrite rules are interpreted at "compile-time", which means that they do not slow-down the XML construction or deconstruction. The rules work the same for readers and writers, because they are applied to name found in the schema.

Key rewrite rules can be set during schema object initiation with new(key_rewrite) and to an existing schema object with addKeyRewrite(). These rules will be used in all calls to compile().

Next, you can use compile(key_rewrite) to add rules which are only used for a single compilation. These are applied before the global rules. All rules will always be attempted, and the rulle will me applied to the result of the previous change.

The last defined rewrite rules will be applied first, with one major exception: the PREFIXED rules will be executed before any other rule.

rewrite via table

When a HASH is provided as rule, then the XML element name is looked-up. If found, the value is used as translated key.

First full name of the element is tried, and then the localName of the element. The full name can be created with XML::Compile::Util::pack_type() or by hand:

  use XML::Compile::Util qw/pack_type/;

  my %table =
    ( pack_type($myns, 'el1') => 'nice_name1'
    , "{$myns}el2" => 'alsoNice'
    , el3          => 'in any namespace'
    );
  $schema->addKeyRewrite( \%table );

rewrite via function

When a CODE reference is provided, it will get called for each key which is found in the schema. Passed are the name-space of the element and its local-name. Returned is the key, which may be the local-name or something else.

For instance, some people use capitals in element names and personally I do not like them:

  sub dont_like_capitals($$)
  {   my ($ns, $local) = @_;
      lc $local;
  }
  $schema->addKeyRewrite( \&dont_like_capitals );

for short:

  my $schema = XML::Compile::Schema->new( ..., 
      key_rewrite => sub { lc $_[1] } );

rewrite when localName collides

Let's start with an appology: we cannot auto-detect when these rewrite rules are needed, because the colliding keys are within the same HASH, but the processing is fragmented over various (sequence) blocks: the parser does not have the overview on which keys of the HASH are used for which elements.

The problem occurs when one complex type or substitutionGroup contains multiple elements with the same localName, but from different name-spaces. In the perl representation of the data, the name-spaces get ignored (to make the programmer's life simple) but that may cause these nasty conflicts.

rewrite for convenience

In XML, we often see names like my-elem-name, which in Perl would be accessed as

  $h->{'my-elem-name'}

In this case, you cannot leave-out the quotes in your perl code, which is quite inconvenient, because only 'barewords' can be used as keys unquoted. When you use option key_rewrite for compile() or new(), you could decide to map dashes onto underscores.

  key_rewrite
     => sub { my ($ns, $local) = @_; $local =~ s/\-/_/g; $local }

  key_rewrite => sub { $_[1] =~ s/\-/_/g; $_[1] }

then my-elem-name in XML will get mapped onto my_elem_name in Perl, both in the READER as the WRITER. Be warned that the substitute command returns the success, not the modified value!

pre-defined rewrite rules

UNDERSCORES

Replace dashes (-) with underscores (_).

SIMPLIFIED

Rewrite rule with the constant name (STRING) SIMPLIFIED will replace all dashes with underscores, translate capitals into lowercase, and remove all other characters which are none-bareword (if possible, I am too lazy to check)

PREFIXED

This requires a table for prefix to name-space translations, via compile(prefixes), which defines at least one non-empty (default) prefix. The keys which represent elements in any name-space which has a prefix defined will have that prefix and an underscore prepended.

Be warned that the name-spaces which you provide are used, not the once used in the schema. Example:

  my $r = $schema->compile
    ( READER => $type
    , prefixes    => [ mine => $myns ]
    , key_rewrite => 'PREFIXED'
    );

  my $xml = $r->( <<__XML );
<data xmlns="$myns"><x>42</x></data>
__XML

  print join ' => ', %$xml;    #   mine_x => 42
PREFIXED(...)

Like the previous, but now only use a selected sub-set of the available prefixes. This is particular useful in writers, when explicit prefixes are also used to beautify the output.

The prefixes are not checked against the prefix list, and may have surrounding blanks.

  key_rewrite => 'PREFIXED(opt,sar)'

Above is equivalent to:

  key_rewrite => [ 'PREFIXED(opt)', 'PREFIXED(sar)' ]

Special care is taken that the prefix will not be added twice. For instance, if the same prefix appears twice, or a PREFIXED rule is provided as well, then still only one prefix is added.

Default Values

[added in v0.91] With compile(default_values) you can control how much information about default values defined by the schema will be passed into your program.

The choices, available for both READER and WRITER, are:

IGNORE (the WRITER's standard behavior)

Only include element and attribute values in the result if they are in the XML message. Behaviorally, this treats elements with default values as if they are just optional. The WRITER does not try to be smarter than you.

EXTEND (the READER's standard behavior)

If some element or attribute is not in the source but has a default in the schema, that value will be produced. This is very convenient for the READER, because your application does not have to hard-code the same constant values as defaults as well.

MINIMAL

Only produce the values which differ from the defaults. This choice is useful when producing XML, to reduce the size of the output.

example: use of default_values EXTEND

Let us process a schema using the schema schema. A schema file can contain lines like this:

 <element minOccurs="0" ref="myelem"/>

In mode EXTEND (the READER default), this gets translated into:

 element => { ref => 'myelem', maxOccurs => 1
            , minOccurs => 0, nillable => 0 };

With EXTEND in the READER, all schema information is used to provide a complete overview of available information. Your code does not need to check whether the attributes were available or not: attributes with defaults or fixed values are automatically added.

Again mode EXTEND, now for the writer:

 element => { ref => 'myelem', minOccurs => 0 };
 <element minOccurs="0" maxOccurs="1" ref="myelem" nillable="0"/>

example: use of default_values IGNORE

With option default_values set to IGNORE (the WRITER default), you would get

 element => { ref => 'myelem', maxOccurs => 1, minOccurs => 0 }
 <element minOccurs="0" maxOccurs="1" ref="myelem"/>

The same in both translation directions. The nillable attribute is not used, so will not be shown by the READER. The writer does not try to be smart, so does not add the nillable default.

example: use of default_values MINIMAL

With option default_values set to MINIMAL, the READER would do this:

 <element minOccurs="0" maxOccurs="1" ref="myelem"/>
 element => { ref => 'myelem', minOccurs => 0 }

The maxOccurs default is "1", so will not be included, minimalizing the size of the HASH.

For the WRITER:

 element => { ref => 'myelem', minOccurs => 0, nillable => 0 }
 <element minOccurs="0" ref="myelem"/>

because the default value for nillable is '0', it will not show as attribute value.

DIAGNOSTICS

Error: cannot find pre-installed name-space files

    Use $ENV{SCHEMA_LOCATION} or new(schema_dirs) to express location of installed name-space files, which came with the XML::Compile distribution package.

Error: don't known how to interpret XML data

SEE ALSO

This module is part of XML-Compile distribution version 1.02, built on February 12, 2009. Website: http://perl.overmeer.net/xml-compile/

All modules in this suite: XML::Compile, XML::Compile::SOAP, XML::Compile::SOAP12, XML::Compile::SOAP::Daemon, XML::Compile::Tester, XML::Compile::Cache, XML::Compile::Dumper, XML::Rewrite, and XML::LibXML::Simple.

Please post questions or ideas to the mailinglist at http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/xml-compile For life contact with other developers, visit the #xml-compile channel on irc.perl.org.

LICENSE

Copyrights 2006-2009 by Mark Overmeer. For other contributors see ChangeLog.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See http://www.perl.com/perl/misc/Artistic.html