The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Alvis::Encoding - Perl extension for guessing and checking the encoding of documents.

SYNOPSIS

 use Alvis::Encoding;

 # Create a new instance
 my $e=Alvis::Encoding->new();
 if (!defined($e))
 {
    die "Instantiating Alvis::Encoding failed.";
 }

 # Check that a (decimal) character code is legal UTF-8
 my $code=55;
 if (!$e->code_is_utf8($code))
 {
    # The message will contain the position and the offending character's code 
    die $e->errmsg();
 }

 # Check that a text is legal UTF-8
 my $text;
 if (!$e->is_utf8($text))
 {
    # The message will contain the position and the offending character's code 
    die $e->errmsg();
 }

 # If you need to obtain the position (1..) and the offending character,
 # pass a placeholder in a hash ref argument:
 my %err=();
 if (!$e->is_utf8($text,\%err))
 {
    my $position=$err{pos};
    my $code=$err{code};
    . . . 
 }

 # 
 # Guess the encoding of a document given a guess for its type 
 #
 my $type_guesser=Alvis::Document::Type->new();
 my ($doc_type,$doc_sub_type)=$type_guesser->guess($text);
 my $doc_encoding=$e->guess($text,$doc_type,$doc_sub_type);
 if (!defined($doc_encoding))
 {
     die('Cannot guess. ' . $e->errmsg());
 }

 # 
 # Try converting a document to UTF-8 with only its type known
 #
 my $type_guesser=Alvis::Document::Type->new();
 my ($doc_type,$doc_sub_type)=$type_guesser->guess($text);
 my $doc_in_utf8=$e->try_to_convert_to_utf8($text,$doc_type,$doc_sub_type);
 if (!defined($doc_in_utf8))
 {
     die('Cannot guess. ' . $e->errmsg());
 }
 
 # Try to guess what was meant 
 my @possibilities=$e->guess_typo_fixes('uft-8');

DESCRIPTION

A collection of methods for guessing, confirming and fixing the encoding of a document.

METHODS

new()

Options:

    defaultDocType       default type for a document. Default: text.
    defaultDocSubType    default sub type for a document. Default: html.
    defaultEncoding      default encoding for a document. Default: iso-8859-1.

code_is_utf8(decimal_code)

Returns 1 if the (decimal) character code is legal UTF-8.

is_utf8(text,err_hash_ref)

Returns 1 if all of the characters of $text are legal UTF-8 Else, returns 0 and sets an error message specifying the location (1..) of the first illegal character code If you wish to obtain the position and offending code, pass a hash ref ($err_hash_ref). The info is in $err_hash_ref->{pos} and $err_hash_ref->{code}.

guess(text,doc_type,doc_sub_type)

Guess the encoding of a document given a guess for its type (and subtype).

guess_and_convert(text,doc_type,doc_sub_type,target_encoding)

Tries to first guess the encoding of the document given a guess at its type and subtype, and then tries to convert it to $target_encoding.

convert(text,source_encoding,target_encoding)

Tries to convert $text from $source_encoding to $target_encoding.

guess_typo_fixes($typo)

Returns a set of guesses for the meant encoding in a case of an encoding name containing typos.

errmsg()

Returns a stack of error messages, if any. Empty string otherwise.

SEE ALSO

Alvis::Document::Type

AUTHOR

Kimmo Valtonen, <kimmo.valtonen@hiit.fi>

COPYRIGHT AND LICENSE

Copyright (C) 2006 by Kimmo Valtonen

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.