The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Encode::Detective - guess the encoding of text

SYNOPSIS

use Encode::Detective 'detect';
my $encoding = detect ($data);
# Now $encoding contains a guess of the encoding of $data.

DESCRIPTION

This module guesses the character set of input data. It is similar to Encode::Guess, but does not require a list of expected encodings.

FUNCTIONS

detect

my $encoding = detect ($text);

Given some bytes of text, $text, this guesses what encoding they are in. If the encoding cannot be guessed, or if it seems to be ASCII, the undefined value is returned. $encoding can then be passed to the decode method of Encode:

use Encode 'decode';
my $encoding = detect ($text);
if ($encoding) {
    $text = decode ($encoding, $text);
}

DETECTED AND UNDETECTED ENCODINGS

Detected encodings

The following encodings are detected:

UTF-8

Unicode.

EUC-JP

Japanese Extended Unix Code.

Big5

Chinese encoding.

Shift_JIS

Japanese Microsoft encoding.

EUC-KR

Korean Extended Unix Code.

EUC-TW

Taiwanese encoding.

windows-1251

Cyrillic encoding.

windows-1255

Hebrew encoding.

windows-1252

French/European encoding.

Undetected encodings

The following character sets are not detected:

mac roman

A MacIntosh encoding incorporating some European letters.

CP932

An extension of Shift-JIS, more common in practice than actual Shift-JIS. This has code points for things like ① (circled one) which don't exist in Shift-JIS.

TIS-620

A Thai encoding.

BUGS

Lacks regression tests

The module needs many, many more regression tests before any work can be done on altering the underlying algorithms. The existing library already has checks for some of the encodings which are not detected. These are currently switched off for some reason, maybe because it was not possible to unambiguously detect them. Without very solid regression tests, no patches can be applied to the detection code, since it will not be clear whether or not the existing encoding detection for every possible encoding is damaged by the patches.

Please send example files containing encodings. Either fork the repository and add files in the directory t/samples, or send them to the module maintainer at <bkb@cpan.org>.

TIS-620

TIS-620 does not seem to be detected.

Documentation of detection

The documentation of detected encodings above is not complete.

HISTORY

Encode::Detective is based on the C++ library for character set detection in the Firefox web browser. This library used to be available as a standalone library. Unfortunately, as of 2012, the Firefox code has been integrated into the browser code, and it cannot be used as a standalone library, so this has become a fork of the original Mozilla code.

Encode::Detective is a fork of Encode::Detect. It removes almost all of the interface of Encode::Detect except the single function "detect". This fork is intended to improve the compilation of the module on various systems. It was released to CPAN to access CPAN testers.

SEE ALSO

edetect

The edetect standalone script installed with Encode::Detective can guess the encodings of files.

Online demonstration

LeMoDa.net offers an online detection service, which also checks the HTTP response header and the meta tag (the tag containing charset=) of the page.

Encode::Guess

Encode::Guess is a Perl module which does something similar to Encode::Detective. It is slightly different in that it requires a list of candidate encodings.

Encode::Detect

Encode::Detect is the original version of this module. Whereas this module offers the single interface, "detect", Encode::Detect has facilities to create an object, make multiple reads, and detect end of file.

IO::HTML

IO::HTML uses byte order mark inspection and inspection of HTML to detect encodings of web pages (see http://en.wikipedia.org/wiki/Byte_order_mark).

C++ and Perl XS

There is a short description of the method used in this module to combine C++ and Perl XS at http://www.lemoda.net/perl/xs-and-cplusplus/.

AUTHORS

Encode::Detective is based on Encode::Detect by John Gardiner Myers <jgmyers@proofpoint.com>. It was forked by Ben Bullock <bkb@cpan.org>.

LICENCE

This Perl module may be used, copied, modified and redistributed under the terms of the Mozilla Public License version 1.1, the GNU General Public License, or the LGPL.