The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

SGML::Parser::OpenSP - Parse SGML documents using OpenSP

SYNOPSIS

  use SGML::Parser::OpenSP;

  my $p = SGML::Parser::OpenSP->new;
  my $h = ExampleHandler->new;

  $p->catalogs(qw(xhtml.soc));
  $p->warnings(qw(xml valid));
  $p->handler($h);

  $p->parse("example.xhtml");

DESCRIPTION

This module provides an interface to the OpenSP SGML parser. OpenSP and this module are event based. As the parser recognizes parts of the document (say the start or end of an element), then any handlers registered for that type of an event are called with suitable parameters.

CONFIGURATION

BOOLEAN OPTIONS

$p->handler([$handler])

Report events to the blessed reference $handler.

$p->show_open_entities([$bool])

Describe open entities in error messages. Error messages always include the position of the most recently opened external entity. The default is false.

$p->show_open_elements([$bool])

Show the generic identifiers of open elements in error messages. The default is false.

$p->show_error_numbers([$bool])

Show message numbers in error messages.

$p->output_comment_decls([$bool])

Generate comment_decl events. The default is false.

$p->output_marked_sections([$bool])

Generate marked section events (marked_section_start, marked_section_end, ignored_chars). The default is false.

$p->output_general_entities([$bool])

Generate general_entity events. The default is false.

$p->map_catalog_document([$bool])

???

$p->restrict_file_reading([$bool])

Restrict file reading to the specified directories. The default is false.

@@state where this is specified.

OTHER OPTIONS

$p->catalogs([@catalogs])

Map public identifiers and entity names to system identifiers using the specified catalog entry files. Multiple catalogs are allowed. If there is a catalog entry file called catalog in the same place as the document entity, it will be searched for immediately after those specified.

$p->search_dirs([@search_dirs])

Search the specified directories for files specified in system identifiers. Multiple values options are allowed. See the description of the osfile storage manager in the OpenSP documentation for more information about file searching.

$p->include_params([@include_params])

For each name in @include_params pretend that

  <!ENTITY % name "INCLUDE">

occurs at the start of the document type declaration subset in the SGML document entity. Since repeated definitions of an entity are ignored, this definition will take precedence over any other definitions of this entity in the document type declaration. Multiple names are allowed. If the SGML declaration replaces the reserved name INCLUDE then the new reserved name will be the replacement text of the entity. Typically the document type declaration will contain

  <!ENTITY % name "IGNORE">

and will use %name; in the status keyword specification of a marked section declaration. In this case the effect of the option will be to cause the marked section not to be ignored.

$p->active_links([@active_links])

???

ENABLING WARNINGS

Additional warnings can be enabled using

  $p->warnings([@warnings])

The following values can be used to enable warnings:

xml

Warn about constructs that are not allowed by XML.

mixed

Warn about mixed content models that do not allow #pcdata anywhere.

sgmldecl

Warn about various dubious constructions in the SGML declaration.

should

Warn about various recommendations made in ISO 8879 that the document does not comply with. (Recommendations are expressed with ``should'', as distinct from requirements which are usually expressed with ``shall''.)

default

Warn about defaulted references.

duplicate

Warn about duplicate entity declarations.

undefined

Warn about undefined elements: elements used in the DTD but not defined.

unclosed

Warn about unclosed start and end-tags.

empty

Warn about empty start and end-tags.

net

Warn about net-enabling start-tags and null end-tags.

min-tag

Warn about minimized start and end-tags. Equivalent to combination of unclosed, empty and net warnings.

unused-map

Warn about unused short reference maps: maps that are declared with a short reference mapping declaration but never used in a short reference use declaration in the DTD.

unused-param

Warn about parameter entities that are defined but not used in a DTD. Unused internal parameter entities whose text is INCLUDE or IGNORE won't get the warning.

notation-sysid

Warn about notations for which no system identifier could be generated.

all

Warn about conditions that should usually be avoided (in the opinion of the author). Equivalent to: mixed, should, default, undefined, sgmldecl, unused-map, unused-param, empty and unclosed.

DISABLING WARNINGS

A warning can be disabled by using its name prefixed with no-. Thus calling warnings(qw(all no-duplicate)) will enable all warnings except those about duplicate entity declarations.

The following values for warnings() disable errors:

no-idref

Do not give an error for an ID reference value which no element has as its ID. The effect will be as if each attribute declared as an ID reference value had been declared as a name.

no-significant

Do not give an error when a character that is not a significant character in the reference concrete syntax occurs in a literal in the SGML declaration. This may be useful in conjunction with certain buggy test suites.

no-valid

Do not require the document to be type-valid. This has the effect of changing the SGML declaration to specify VALIDITY NOASSERT and IMPLYDEF ATTLIST YES ELEMENT YES. An option of valid has the effect of changing the SGML declaration to specify VALIDITY TYPE and IMPLYDEF ATTLIST NO ELEMENT NO. If neither valid nor no-valid are specified, then the VALIDITY and IMPLYDEF specified in the SGML declaration will be used.

PROCESSING FILES

In order to start processing of a document and recieve events, the parse method must be called. It takes one argument specifying the path to a file (not a file handle). You must set an event handler using the handler method prior to using this method. The return value of parse is currently undefined.

EVENT HANDLERS

In order to receive data from the parser you need to write an event handler. For example,

  package ExampleHandler;

  sub new { bless {}, shift }

  sub start_element
  {
      my ($self, $elem) = @_;
      printf "  * %s\n", $elem->{Name};
  }

This handler would print all the element names as they are found in the document, for a typical XHTML document this might result in something like

  * html
  * head
  * title
  * body
  * p
  * ...

The events closely match those in the generic interface to OpenSP, see http://openjade.sf.net/doc/generic.htm for more information.

The event names have been changed to lowercase and underscores to separate words and properties are capitalized. Arrays are represented as Perl array references. Position information is not passed to the handler but made available through the get_location method which can be called from event handlers. Some redundant information has also been stripped and the generic identifier of an element is stored in the Name hash entry.

For example, for an EndElementEvent the end_element handler gets called with a hash reference

  {
    Name => 'gi'
  }

The following events are defined:

  * appinfo
  * pi
  * start_element
  * end_element
  * data
  * sdata
  * external_data_entity_ref
  * subdoc_entity_ref
  * start_dtd
  * end_dtd
  * end_prolog
  * general_entity       # set $p->output_general_entities(1)
  * comment_decl         # set $p->output_comment_decls(1)
  * marked_section_start # set $p->output_marked_sections(1)
  * marked_section_end   # set $p->output_marked_sections(1)
  * ignored_chars        # set $p->output_marked_sections(1)
  * error
  * open_entity_change

If the documentation of the generic interface to OpenSP states that certain data is not valid, it will not be available through this interface (i.e., the respective key does not exist in the hash ref).

POSITIONING INFORMATION

Event handlers can call the get_location method on the parser object to retrieve positioning information, the get_location method will return a hash reference with the following properties:

  LineNumber   => ..., # line number
  ColumnNumber => ..., # column number
  ByteOffset   => ..., # number of preceding bytes
  EntityOffset => ..., # number of preceding bit combinations
  EntityName   => ..., # name of the external entity
  FileName     => ..., # name of the file

These can be undef or an empty string.

UNICODE SUPPORT

All strings returned from event handlers and helper routines are UTF-8 encoded with the UTF-8 flag turned on, helper functions like split_message expect (but don't check) that string arguments are UTF-8 encoded and have the UTF-8 flag turned on. Behavior of helper functions is undefined when you pass unexpected input and should be avoided.

parse has limited support for binary input, but the binary input must be compatible with OpenSP's generic interface requirements and you must specify the encoding through means available to OpenSP to enable it to properly decode the binary input. Any encoding meta data about such binary input specific to Perl (such as encoding disciplines for file handles when you pass a file descriptor) will be ignored. For more specific information refer to the OpenSP manual.

ENVIRONMENT VARIABLES

OpenSP supports a number of environment variables to control specific processing aspects such as SGML_SEARCH_PATH or SP_CHARSET_FIXED. Portable applications need to ensure that these are set prior to loading the OpenSP library into memory which happens when the XS code is loaded. This means you need to wrap the code into a BEGIN block:

  BEGIN { $ENV{SP_CHARSET_FIXED} = 1; }
  use SGML::Parser::OpenSP;
  # ...

Otherwise changes to the environment might not propagate to OpenSP. This applies specifically to Win32 systems.

SGML_SEARCH_PATH

See http://openjade.sourceforge.net/doc/sysid.htm.

SP_HTTP_USER_AGENT

The User-Agent header for HTTP requests.

SP_HTTP_ACCEPT

The Accept header for HTTP requests.

SP_MESSAGE_FORMAT

Enable run time selection of message format, Value is one of XML, NONE, TRADITIONAL. Whether this will have an effect depends on a compile time setting which might not be enabled in your OpenSP build. This module assumes that no such support was compiled in.

SGML_CATALOG_FILES
SP_USE_DOCUMENT_CATALOG

See http://openjade.sourceforge.net/doc/catalog.htm.

SP_SYSTEM_CHARSET
SP_CHARSET_FIXED
SP_BCTF
SP_ENCODING

See http://openjade.sourceforge.net/doc/charset.htm.

Note that you can use the search_dirs method instead of using SGML_SEARCH_PATH and the catalogs method instead of using SGML_CATALOG_FILES and attributes on storage object specifications for SP_BCTF and SP_ENCODING respectively. For example, if SP_CHARSET_FIXED is set to 1 you can use

  $p->parse("<OSFILE encoding='UTF-8'>example.xhtml");

to process example.xhtml using the UTF-8 character encoding.

KNOWN ISSUES

OpenSP must be compiled with SP_MULTI_BYTE defined and with SP_WIDE_SYSTEM undefined, this module will otherwise break at runtime or not compile.

Individual warnings for -wxml are not listed in this POD.

The typemap is crap.

BUG REPORTS

Please report bugs in this module via http://rt.cpan.org/NoAuth/Bugs.html?Dist=SGML-Parser-OpenSP

Please report bugs in OpenSP via http://sf.net/tracker/?group_id=2115&atid=102115

Please send comments and questions to the spo-devel mailing list, see http://lists.sf.net/lists/listinfo/spo-devel for details.

SEE ALSO

AUTHOR AND COPYRIGHT

  Terje Bless <link@cpan.org> wrote version 0.01.
  Bjoern Hoehrmann <bjoern@hoehrmann.de> wrote version 0.02.

  Copyright (c) 2004 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
  This module is licensed under the same terms as Perl itself.