NAME

Alvis::Canonical - Perl extension for converting documents in various formats into the Alvis canonical format for documents

SYNOPSIS

 use Alvis::Canonical;

 # Create a new instance, specify the conversion of both numeric and 
 # symbolic character entities to Unicode characters
 my $C=Alvis::Canonical->new(convertCharEnts=>1,
                             convertNumEnts=>1);
 if (!defined($C))
 {
     die("Unable to instantiate Alvis::Canonical.");
 }

 # Convert an HTML document text in UTF-8 to the canonical format.
 # Specify that you want the title and baseURL as well, if any can be
 # determined.
 my ($txt,$header)=$C->HTML($html,
                            {title=>1,
                             baseURL=>1});
 if (!defined($txt))
 {
    die $C->errmsg();
 }

DESCRIPTION

Assumes the input is in UTF-8 and does NOT contain '\0's (or rather that they carry no meaning and are removable).

METHODS

new()

Available options:

    warnings         Issue warnings about badly faulty original HTML where
                     we have to resort to an heuristic solution.
                     Puts a warning to STDERR documenting the error and
                     the solution. Default: no.
    convertCharEnts  Convert HTML symbolic character entities to UTF-8 
                     characters? Default: yes.
    convertNumEnts   Convert HTML numerical character entities to UTF-8 
                     characters? Default: yes.
    sourceEncoding   the encoding of the source documents. Default: undef,
                     which means it is guessed.  
     
  my $C=Alvis::Canonical->new(convertCharEnts=>1,
                              convertNumEnts=>1);
  if (!defined($C))
  {
    die die("Unable to instantiate Alvis::Canonical.");
  }

HTML($html,$options)

Converts dirty HTML to a valid Alvis canonicalDocument. $options is a mechanism for returning the title and base URL of the document. If their extraction is desired, set fields 'title' and 'baseURL' to a defined value. If you know the encoding of the source document, set option 'sourceEncoding', e.g.

  my ($txt,$header)=$C->HTML($html,
                            {title=>1,
                             baseURL=>1,
                             sourceEncoding=>'iso-8859-2'});

errmsg()

Returns a stack of error messages, if any. Empty string otherwise.

SEE ALSO

Alvis::Convert

AUTHOR

Kimmo Valtonen, <kimmo.valtonen@hiit.fi>

COPYRIGHT AND LICENSE

Copyright (C) 2006 by Kimmo Valtonen

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.