Brendan Quinn

NAME

Syndication::NewsML -- Parser for NewsML documents

VERSION

Version $Revision: 0.10 $, released $Date: 2002/02/13 14:01:18 $

SYNOPSIS

 use Syndication::NewsML;

 my $newsml = new Syndication::NewsML("myNewsMLfile.xml");
 my $env = $newsml->getNewsEnvelope;

 my $dateAndTime = $env->getDateAndTime->getText;

 foreach my $newsitem ($newsml->getNewsItemList) {
   # do something with the news item
 }
 ...

DESCRIPTION

Syndication::NewsML parses XML files complying to the NewsML specification, created by the International Press Telecommunications Council (http://www.iptc.org).

NewsML is a standard format for the markup of multimedia news content. According to the newsml.org website, NewsML is "An XML-based standard to represent and manage news throughout its lifecycle, including production, interchange, and consumer use."

NewsML differs from simpler news markup and syndication standards such as RSS (see the XML::RSS module on your local CPAN) in that RSS files contain links to stories, whereas NewsML can be used to send links or the story itself, plus any associated information such as images, video or audio files, PDF documents, or any other type of data.

NewsML also offers much more metadata information than RSS, including links between associated content; the ability to revoke, update or modify previously sent stories; support for sending the same story in multiple languages and/or formats; and a method for user-defined metadata known as Topic Sets.

Theoretically you could use RSS to link to articles created in NewsML, although in reality news providers and syndicators are more likely to use a more robust and traceable syndication transport protocol such as ICE (see http://www.icestandard.org).

Syndication::NewsML is an object-oriented Perl interface to NewsML documents. It aims to let users manage and create NewsML documents without any specialised NewsML or XML knowledge.

Initialization

At the moment the constructor can only take a filename as an argument, as follows:

  my $newsml = new Syndication::NewsML("file-to-parse.xml");

This attaches a parser to the file (using XML::DOM), and returns a reference to the first NewsML tag. (I may decide that this is a bad idea and change it soon)

Reading objects

There are six main types of calls:

  • Return a reference to an array of elements:

      my $topicsets = $newsml->getTopicSetList;

    The array can be referenced as @$topicsets, or an individual element can be referenced as $topicsets->[N].

  • Return an actual array of elements:

      my @topicsets = $newsml->getTopicSetList;

    The array can be referenced as @topicsets, or an individual element can be referenced as $topicsets[N]. In addition you can iterate through an array by saying something like

      foreach my $topicset ($newsml->getTopicSetList) {
        ...
      }
  • Return the size of a list of elements:

      my $topicsetcount = $newsml->getTopicSetCount;
  • Get an individual element:

      my $catalog = $topicsets->[0]->getCatalog;
  • Get an attribute of an element (as text):

      my $href = $catalog->getHref;
  • Get the contents of an element (ie the text between the opening and closing tags):

      my $urlnode = $catalog->getResourceList->[0]->getUrlList->[0];
      my $urltext = $urlnode->getText;

Not all of these calls work for all elements: for example, if an element is defined in the NewsML DTD as having zero or one instances in its parent element, and you try to call getXXXList, Syndication::NewsML will "croak" an error. Similarly when you call getXXX when the DTD specifies that an element can exist more than once in that context, NewsML.pm will flag an error to the effect that you should be calling getXXXList instead. (The error handling will be improved in the future so that it won't croak fatally -- unless you want that to happen.)

The NewsML standard contains some "business rules" also written into the DTD: for example, a NewsItem may contain nothing, a NewsComponent, one or more Update elements, or a TopicSet. For some of these rules, the module is smart enough to detect errors and provide a warning. Again, these warnings will be improved and extended in future versions of this module.

Documentation for all the classes

Each NewsML element is represented as a class. This means that you can traverse documents as Perl objects, as seen above.

Full documentation of which classes can be used in which documents is beyond me right now (with over 120 classes to document), so for now you'll have to work with the examples in the examples/ and t/ directories to see what's going on. You should be able to get a handle on it fairly quickly.

The real problem is that it's hard to know when to use getXXX() and when to use GetXXXList() -- that is, when an element can have more than one entry and when it is a singleton. Quite often it isn't obvious from looking at a NewsML document. For now, two ways to work this out are to try it and see if you get an error, or to have a copy of the DTD in front of you. Obviously neither of these is optimal, but documenting all 127 classes just so people can tell this difference is pretty scary as well, and so much documentation would put lots of people off using the module. So I'll probably come up with a reference document listing all the classes and methods, rather than docs for each class, in a future release. If anyone has any better ideas, please let me know.

BUGS

None that I know of, but there are probably many. The test suite isn't complete, so not every method is tested, but the major ones (seem to) work fine. Of course, if you find bugs, I'd be very keen to hear about them at brendan@clueful.com.au.

SEE ALSO

XML::DOM, XML::RSS, XML::XPath, Syndication::NITF

AUTHOR

Brendan Quinn, Clueful Consulting Pty Ltd (brendan@clueful.com.au)

COPYRIGHT

Copyright (c) 2001, 2002, Brendan Quinn. All Rights Reserved. This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.