The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Treex::PML::Instance - Perl extension for loading/saving PML data

SYNOPSIS

   use Treex::PML::Instance;

   Treex::PML::AddResourcePath( "$ENV{HOME}/my_pml_schemas" );

   my $pml = Treex::PML::Instance->load({ filename => 'foo.xml' });

   my $schema = $pml->get_schema;
   my $data   = $pml->get_root;

   $pml->save();

DESCRIPTION

This class provides a simple implementation of a PML instance.

EXPORT

None by default.

The following export tags are available:

:constants

Imports the following constants:

LM

name of the "<LM>" (list-member) tag

AM

name of the "<AM>" (alt-member) tag

PML_NS

XML namespace URI for PML instances

PML_SCHEMA_NS

XML namespace URI for PML schemas

SUPPORTED_PML_VERSIONS

space-separated list of supported PML-schema version numbers

:diagnostics

Imports internal _die, _warn, and _debug diagnostics commands.

CONFIGURATION

The option 'config' of the methods load() and save() can provide a parsed configuration file. The configuration file is a PML instance whose PML schema is defined in the file pmlbackend_conf_schema.xml distributed with Treex::PML in Treex/PML/Backend/pmlbackend_conf_schema.xml.

This file can set defaults for some options of load() and save() and it can also define rules for pre-processing the input documents before parsing them as PML and for post-processing the output documents after serializing them as PML. Currently only XSLT 1.0, Perl and external-command pre-processing and XSLT 1.0 post-processing are implemented.

The PMLTransform backend, when intialized (e.g. by calling by calling AddBackend('PMLTransform')), automatically loads the first configuration file named pmlbackend_conf.xml it finds in the Treex::PML's resource paths. Additionally, it searches for all configuration files named pmlbackend_conf.inc in the resource paths and merges their transformation rules into in-memory image of the main configuration file. Then, PMLTransform uses this resulting configuration for all load/save operations.

IMPORTANT NOTE: it is recommended to add the PMLTransform backend as the last I/O backend since its test() method automatically accepts any XML file (with the prospect of attempting to transform it during the read() phase)! So it must be added into the I/O backends list after all other backends working with XML-based formats.

Here is an example of a configuration file (see the schema for more details).

    <?xml version="1.0" encoding="utf-8"?>
    <pmlbackend xmlns="http://ufal.mff.cuni.cz/pdt/pml/">
      <head>
        <schema href="pmlbackend_conf_schema.xml"/>
      </head>
      <options>
        <load>
          <validate_cdata>1</validate_cdata>
          <use_resources>1</use_resources>
        </load>
        <save>
          <indent>4</indent>
          <validate_cdata>1</validate_cdata>
          <write_single_LM>1</write_single_LM>
        </save>
      </options>
      <transform_map>
        <transform id="alpino" test="alpino_ds[@version='1.1' or @version='1.2']">
          <in type="xslt" href="alpino2pml.xsl"/>
          <out type="xslt" href="pml2alpino.xsl"/>
        </transform>
        <transform id="sdata" root="sdata" ns="http://ufal.mff.cuni.cz/pdt/pml/">
          <in type="perl" command="require SDataMerge; return SDataMerge::transform(@_);"/>
        </transform>
        <transform id="tei" test="*[namespace-uri()='http://www.tei-c.org/ns/1.0']">
          <in type="pipe" command="tei2pml.sh">
            <param name="--stdin" />
            <param name="--stdout" />
          </in>
        </transform>
      </transform_map>
    </pmlbackend>

METHODS

Treex::PML::Instance->new ()

NOTE: Don't call this constructor directly, use Treex::PML::Factory->createPMLInstance() instead!

Create a new empty PML instance object.

Treex::PML::Instance->load (\%opts)
$pml->load (\%opts)

NOTE: Don't call this method as a constructor directly, use Treex::PML::Factory->createPMLInstance() instead!

Read a PML instance from file, filehandle, string, or DOM. This method may be used both on an existing object (in which case it operates on and returns this object) or as a constructor (in which case it creates a new Treex::PML::Instance object and returns it). Possible options are:

  {
    filename => $filename,   # and/or
    fh => \*FH,              # or
    string => $xml_string,   # or
    dom => $document,        # (XML::LibXML::Document)

    config => $cfg_pml,      # (Treex::PML::Instance)

    parser_options => \%opt, # (XML::LibXML parser options)
    no_trees => $bool,
    no_references => $bool,
    no_knit => $bool,
    selected_references => { name => $bool, ... },
    selected_knits => { name => $bool, ... }
  }

where filename may be used either by itself or in combination with any of fh , string, or dom, which are otherwise mutually exclusive. The config option may be used to pass a Treex::PML::Instance with the parsed PML backend configuration file (see "CONFIGURATION"). The parser_options option may be used to pass a HASH reference containing options for the XML::LibXML parser (depending on implementation, these will be used to configure either an XML::LibXML::Reader or an XML::LibXML::Parser). If no_trees is true, then the roles #TREES, #NODE and #CHILDNODES are ignored. The option selected_references determines which reffiles (with non-empty readas attribute) to read; if true, the reffile with a given name is read, if false, it is never read; if a value is not given for some reffile, the reffile is read unless the no_references flag is on. The options selected_knits and no_knits determine data from which reffiles can be copied into this document following the rules for the role #KNIT. Their meaning is just like that for selected_references and no_references. Moreover, no_references implies no_knit, unless no_knit is explicitly specified.

$pml->get_status ()

Returns 1 if the last load() was successful.

$pml->save (\%opts)

Save PML instance to a file or file-handle. Possible options are: filename, fh, config, refs_save, write_single_LM. If both filename and fh are specified, fh is used, but the filename associated with the Treex::PML::Instance object is changed to filename. If neither is given, the filename currently associated with the Treex::PML::Instance object is used. The config option may be used to pass a Treex::PML::Instance representing the parsed PML backend configuration file (see "CONFIGURATION"). The refs_save option may be used to specify which reference files should be saved along with the Treex::PML::Instance and where to. The value of refs_save, if given, should be a HASH reference mapping reference IDs to the target URLs (filenames). If refs_save is given, only those references listed in the HASH are saved along with the Treex::PML::Instance. If refs_save is undefined or not given, all references are saved (to their original locations). In both cases, only files declared as readas='dom' or readas='pml' can be saved.

$pml->convert_to_fsfile (fsfile)

Translates the current Treex::PML::Instance object to a Treex::PML::Document object (using Treex::PML::Document MetaData and AppData fields for storage of non-tree data). If fsfile argument is not provided, creates a new Treex::PML::Document object, otherwise operates on a given fsfile. Returns the resulting Treex::PML::Document object.

$pml->convert_from_fsfile (fsfile)
Treex::PML::Instance->convert_from_fsfile (fsfile)

Translates a Treex::PML::Document object to a Treex::PML::Instance object. Non-tree data are fetched from Treex::PML::Document MetaData and AppData fields. If called on an instance, modifies and returns the instance, otherwise creates and returns a new instance.

Treex::PML::Instance::get_data ($obj,$path)

Retrieve a possibly nested value from the attribute data structure of $obj. The path argument uses an XPath-like expression of the form

   step1/step2/...

where each step (depending on the value retrieved by the preceding part of the expression) can be one of:

name of a member of a structure

to retrieve that member

name of an attribute of a container

to retrieve that attribute

name of an element of a sequence

to retrieve the first element of that name

index of the form [n]

to retrieve n-th element /counting from 1/ from a list, sequence, or an alternative

combination of name and index of the form name[n]

to retrieve n-th element named 'name' from a sequence

combination of index and name of the form [n]name

to retrieve the n-th element of a sequence provided the n-th element's name is 'name'

In the preceding cases, [n] can be negative, in which case the retrieved value is the n-th element from the end of the list or sequence.

If a step of the form [n] is not given for a list or alternative value then [1] is assumed and the next step is processed.

If the value retrieved by some step is undefined or the step does not match the data type of the value retrieved by the preceding steps, the evaluation is stopped and undef is returned.

For example,

  my $value = Treex::PML::Instance::get_data($obj,'foo/bar[2]/[-4]/baz/[5]bam');

is roughly equivalent to

  my $el = $obj->{foo}->values('bar')->[1]->[-4]->{baz}->[4];
  my $value = $el->name eq 'bam' ? $el->value : undef;

but without the side effect of creating array or hash structures where there is none. To be more specific, if, say $obj->{x} is not defined, then the Perl expression

   if ($obj->{x}[3]{y}) {...}

automatically causes a side-effect of creating an ARRAY reference in $obj->{x} and a HASH reference in the fourth element of this ARRAY. An analogous construct

   Treex::PML::Instance::get_data($obj,'foo/[4]/baz');

simply returns undef without either of these side-effects.

The following behave the same (provided that the path /foo/bar[2] retrieves a list, sequence or an alternative and /foo/bar[2]/[1]/baz retrieves a sequence):

  my $value = Treex::PML::Instance::get_data($obj,'foo/bar[2]/[1]/baz/[1]bam');
  my $value = Treex::PML::Instance::get_data($obj,'foo/bar[2]/baz/bam');
Treex::PML::Instance::get_all($obj, $path)

This function returns all matches of a given attribute path on the object. It works just as Treex::PML::Instance::get_data except that it recurses into all values of a list, alt or sequence instead of just the first one on attribute-path steps that do not give an exact index. Furthermore, unlike Treex::PML::Instance::get_data, this functions does expands trailing Lists and Alts, which means this: If the path leads to a List or Alt value, the members values are returned instead; this replacement is applied recursively.

The expansion of trailing Lists and Alts can be prevented by appending a slash followed by a dot to the attribute path ("$path/.").

Treex::PML::Instance::set_data ($obj,$path,$value,$strict?)

Store a given value to a possibly nested attribute of $obj specified by path. The path argument uses the XPath-like syntax described above for the method Treex::PML::Instance::get_data. If $strict==0 and a non-index step is to be processed on an alternative or list, then step [1] is assumed and the 1st element of the list or alternative is used for further processing of the path expression (except when this occurs in the last step, in which case the entire list or alternative is overwritten by the given value). If $strict==1 and a non-index step is to be processed on an alternative or list, a warning is issued and undef is returned. If $strict==2, the same approach as with $strict==1 is taken, but croak is used instead of warn.

$pml->for_each_match( { path1 => callback1, path2 => callback2,...})
Treex::PML::Instance::for_each_match( $obj, { path1 => callback1, path2 => callback2,...}, \%opts )

This function traverses a given PML data structure and dispatches callbacks at all occurrences of given attribute paths.

If called on other object that Treex::PML::Instance (i.e. Treex::PML::Struct, Treex::PML::List, etc.), the corresponding data type (Treex::PML::Schema::* object) can be provided in the \%opts argument as

   { type => $type_decl }

The callback gets one argument: a hash reference of the form

  { value => $matched_obj, path => $matched_obj_path, type => $obj_type_decl }

where $matched_obj_path is full canonical path to the matching object. The type key is present in hash only if for_each_match was called on a Treex::PML::Instance or if Treex::PML::Schema type of the initial object was given in \%opts.

The path syntax is as described in Treex::PML::Instance::get_data, with the following differences:

1. Path steps of the form [n] or name[n], where n is a number, are not supported (but steps of the form [n]name work).

2. Additionally, steps can be separated with //. Like in XPath, this indicates a descendant axis, that allows arbitrary structures between the steps. I.e. a//z matches any data matched by a/z, a/b/z, /a/b/c/z, etc. One can also use // at the very beginning of an expression (//a/b) to match arbitrarily nested occurrence of a/b (e.g. one matching x/y/z/a/b).

Treex::PML::Instance::get_all_matches($obj,$path,\%opts)
Treex::PML::Instance::get_all_matches($obj,\@path_list,\%opts)

This function returns all data matching given path or, if the second argument is an array reference, any of given paths. The path(s), as well as $obj and \%opts argument are as in Treex::PML::Instance::for_each_match. The function returns an array in array context and an array reference in scalar context.

Treex::PML::Instance::count_matches($obj,$path,\%opts)
Treex::PML::Instance::count_matches($obj,\@path_list,\%opts)

Like Treex::PML::Instance::get_all_matches, but returns only the number of matching objects (without creating any intermediate list).

Treex::PML::Instance::traverse_data($object, $type_decl, $callback, \%options)

Traverses the nested PML content of the given Treex::PML data object (Treex::PML::Instance, Treex::PML::Node, Treex::PML::Struct, etc.). The second argument must be the type of $object, i.e. a Treex::PML::Schema::Decl (or derived). The $callback is an CODE reference (anonymous function) which will get called for each nested value with the following arguments: the value, type declaration for the value (a Treex::PML::Schema::Decl), and the value of $options{data} passed in by the caller to this method.

Options:

no_childnodes: do not descend into child nodes (role #CHILDNODES)

no_trees: do not descend into lists or sequences with the role #TREE

data: user data passed to the callback

$class_or_instance->validate_object($object, $decl, \%options)

Convenience function which currently just calls:

  $decl->validate_object($object,\%options).

in order to determine, if the object conforms to the data type declaration.

$pml->hash_id (id,object,warn)

Hash a given object under a given ID. If warn is true, then a warning is issued if the ID already wash hashed with a different object.

$pml->lookup_id (id)

Lookup an object by ID.

$pml->get_filename ()

Return the filename (string) or URL (URI object) of the PML instance.

$pml->get_url ()

Return URL of the PML instance as URI object.

$pml->set_filename (filename)

Change filename of the PML instance.

$pml->get_transform_id ()

Return ID of the XSL-based transformation specification which was used to convert between an original non-PML format and PML (and back).

$pml->set_transform_id (transform)

Set ID of an XSL-transformation specification which is to be used for conversion from PML to an external non-PML format (and back).

$pml->get_schema ()

Return Treex::PML::Schema object associated with the PML instance.

$pml->set_schema (schema)

Associate a Treex::PML::Schema with the PML instance (this method should not be used for an instance containing data).

$pml->get_schema_url ()

Return URL of the PML schema file associated with the PML instance.

$pml->set_schema_url (url)

Change URL of the PML schema file associated with the PML instance.

$pml->get_root ()

Return the root data structure.

$pml->set_root (object)

Set the root data structure.

$pml->get_trees ()

Return a Treex::PML::List object containing data structures with role '#NODE' belonging in the first block (list or sequence) with role '#TREES' occuring in the PML instance.

$pml->get_trees_prolog ()

If the PML instance consists of a sequence with role '#TREES', return a Treex::PML::Seq object containing the maximal (but possibly empty) initial segment of this sequience consisting of elements with role other than '#NODE'.

$pml->get_trees_epilog ()

If the PML instance consists of a sequence with role '#TREES', return a Treex::PML::Seq object containing all elements of the sequence following the first maximal contiguous subsequence of elements with role '#NODE'.

$pml->get_trees_type ()

Return the type declaration associated with the list of trees.

$pml->get_references_hash ()

Returns a HASHref mapping file reference IDs to URLs.

$pml->set_references_hash (\%map)

Set a given HASHref as a map between refrence IDs and URLs.

$pml->get_ref_ids_by_name ($name)

Returns a list of reference IDs associated with a given name.

$pml->get_refs_by_name ($name)

Returns a list of references associated with a given name.

$pml->get_reffiles ()

Returns a list of hash references. Each element represents a document referenced from the current instance. The list contains only references that were associated with a name (pre-declared in the PML schema). However, a 'name' can be associated with several document references. The elements in the list returned by this method have the following keys:

readas

the value of the 'readas' attribute of the corresponding PML schema declaration

name

the symbolic name of the (type of the) reference as declared in the PML schema

href

an URI of the target document

id

an ID use in the current PML instance to refer to the target document

$pml->get_refname_hash ()

Returns a HASHref mapping file reference names to reference IDs. Each value of the hash is either a ID string (if there is just one reference with a given name) or a Treex::PML::Alt containing all IDs associated with a given name.

$pml->set_refname_hash (\%map)

Set a given HASHref as a map between refrence IDs and URLs.

$pml->get_ref (id)

Return a DOM or Treex::PML::Instance object representing the referenced resource with a given ID (applies only to resources declared as readas='dom' or readas='pml').

$pml->set_ref (id,object)

Use a given DOM or Treex::PML::Instance object as a resource of the current Treex::PML::Instance with a given ID (note that this may break knitting).

SEE ALSO

Prague Markup Language (PML) format: http://ufal.mff.cuni.cz/jazz/PML/

Tree editor TrEd: http://ufal.mff.cuni.cz/tred

Related packages: Treex::PML, Treex::PML::Schema, Treex::PML::Document

COPYRIGHT AND LICENSE

Copyright (C) 2006-2010 by Petr Pajas

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.2 or, at your option, any later version of Perl 5 you may have available.