The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

ANNOTATION

Lingua::DetectCyrillic. The package detects 7 Cyrillic codings as well as the language - Russian or Ukrainian. Uses embedded frequency dictionaries; usually one word is enough for correct detection.

SYNOPSIS

  use Lingua::DetectCyrillic;
   -or (if you need translation functions) -
  use Lingua::DetectCyrillic qw ( &TranslateCyr &toLowerCyr &toUpperCyr );

  # New class Lingua::DetectCyrillic. By default, not more than 100 Cyrillic
  # tokens (words) will be analyzed; Ukrainian is not detected.
  $CyrDetector = Lingua::DetectCyrillic ->new();

  # The same but: analyze at least 200 tokens, detect both Russian and
  # Ukrainian.
  $CyrDetector = Lingua::DetectCyrillic ->new( MaxTokens => 200, DetectAllLang => 1 );

  # Detect coding and language
  my ($Coding,$Language,$CharsProcessed,$Algorithm)= $CyrDetector -> Detect( @Data );

  # Write report
  $CyrDetector -> LogWrite(); #write to STDOUT
  $CyrDetector -> LogWrite('report.log'); #write to file

  # Translating to Lower case assuming the source coding is windows-1251
  $s=toLowerCyr($String, 'win');
  # Translating to Upper case assuming the source coding is windows-1251
  $s=toUpperCyr($String, 'win');
  # Converting from one coding to another
  # Acceptable coding definitions are win, koi, koi8u, mac, iso, dos, utf
  $s=TranslateCyr('win', 'koi',$String);

See Additional information on usage of this package .

DESCRIPTION

This package permits to detect automatically all live Cyrillic codings - windows-1251, koi8-r, koi8-u, iso-8859-5, utf-8, cp866, x-mac-cyrillic, as well as the language - Russian or Ukrainian. It applies 3 algorithms for detection: formal analysis of alphabet hits, frequency analysis of words and frequency analysis of 2-letter combinations.

It also provides routines for conversion between different codings of Cyrillic texts which can be imported if necessary.

The package permits to detect coding with one or two words only. Certainly, in case of one word reliability will be low, especially if you wrote the words for testing completely in lower or uppercase, as capitalization is a very important attribute for coding detection. Nethertheless the package correctly recognizes coding in a message containing one single word, even all lowercase - 'privet' ('hello' in Russian), 'ivan', 'vodka', 'sputnik'. ;-)))

Ukrainian language will be specified only if the text contains specific Ukrainian letters.

Performance is good as the analysis passes two stages: on the first only formal and fast analysis of proper capitalization and alphabet hit is carried out and only if these data are not enough, the input is analyzed second time - on frequency dictionaries.

DEPENDENCIES

The package requires so far Unicode::String and Unicode::Map8 which can be downloaded from http://www.cpan.org. See Additional information on packages to be installed .

I plan to implement my own support of character decoding so these packages will be not required in future releases.

1 Unicode::Map8

Basic package for conversion between different one-byte codings. Available at http://www.cpan.org .

    Warning! This module requires preleminary compilation with a C++ compiler; under Unix this procedure goes smoothly and doesn't need commenting; but under Win32 with ActiveState Perl you must

    2 use MS Visual C++ and

    2 make some manual changes to the listing after having run Makefile.PL

    Open map8x.c and change the line 97 from

        ch = PerlIO_getc(f);

    to

        ch = getc(f);

    In one word, you need to replace Perl wrapper for C function getc to the function itself. The compiler produces warnings, but as a result you'll get a 100% working DLL.

3 Unicode::String

Provides support for Unicode::Map8. Available at http://www.cpan.org .

USAGE DETAILS

  • Create a class Lingua::DetectCyrillic

      $CyrDetector = Lingua::DetectCyrillic ->new();
      $CyrDetector = Lingua::DetectCyrillic ->new( MaxTokens => 100, DetectAllLang => 1 );

    MaxTokens - the package stops analyzing the input, if the given number of Cyrillic tokens is reached. You have not to analyze all 100 or 200 thousand bytes from the input if after first 100 tokens the coding and the language can be easily determined. If not specified, this argument defaults to 100.

    DetectAllLang - by default the package assumes Russian language only. Setting this parameter to any non-zero value will involve analysis on two languages - Russian and Ukrainian. This slows down perfomance by nearly 10% and can in rare cases may result in a worse coding detection.

  • Pass an array of strings to the class method Detect:

     my ($Coding,$Language,$CharsProcessed,$Algorithm)= $CyrDetector -> Detect( @Data );
    $Coding

    - windows-1251, koi8-r, iso-8859-5, utf-8, cp866, x-mac-cyrillic. If the input doesn't have a single Cyrillic character, returns iso-8859-1. If DetectAllLang > 0, may return koi8-u as well.

    $Language

    - Rus or (if DetectAllLang > 0) Ukr as well. If the input doesn't have a single Cyrillic character, returns NoLang (I can't state for sure this language to be English, German, French or any other ;-).

    $CharsProcessed

    - number of characters processed in the most possible coding. Useful to estimate the level of reliability. If the program found 3-4 poor Cyrillic characters in input no need to say how correct the results are...

    $Algorithm

    - numeric code showing on which stage the program decided to stop further analysis (satisfied with the results). Useful for debugging. If you will report me errors, please refer to this code. For more detailed explanation see the table Algorithm codes explanation.

  • Write a report, if you want

      $CyrDetector -> LogWrite(); #write to STDOUT
      $CyrDetector -> LogWrite('report.log'); #write to file

    If the only argument is not specified or equal to stdout (in upper- or lowercase), the program writes the report to the STDOUT, otherwise to the file.

HOW IT WORKS

Stage 1. Formal analysis of alphabet hits and capitalization

Started programming, I came from an obvious fact: a 'human' reader can easily determine the coding and language from one sight, or at least to say the text to be displayed in a wrong coding. The thing is that the alphabets, i.e. letters of most Cyrillic codings do not coincide so if we try to display text in a bad coding we will inevitably see on screen messy characters inside words which can not be typed with Russian or Ukrainian keyboard layout in a standard way - valuta signs, punctuation marks, Serbian letters, sometimes binary characters etc etc.

Indeed we have only one hard case: the two most popular Cyrillic codings - windows-1251 and koi8-r - have their alphabets in the same range from 192 to 255, but uppercase letters of windows-1251 are placed on the codes of lowercase letters of koi8-r and vice versa, so 'Ivan Petrov' in one of these codings will look like 'iVAN pETROV' in another, i.e. have absolutely wrong capitalization which can be also easily determined by formal analysis of characters. And as you may guess any more or less consistent Cyrillic text must have at least one word starting with a capital letter (I don't take in consideration some weird Internet inhabitants WRITING ALL WITH CAPITAL LETTERS ;-).

This formal analysis is very fast and suits for 99.9% of real texts. Wrong codings are easily filtered out and we get only one 'absolute winner'. This method is also reliable: I can hardly imagine a normal person writing in reverse capitalization. But what if we have only a few words and all them are in upper- or lowerscase?

Stage 2. Frequency analysis of words and 2-letter combinations.

In this case we apply frequency analysis of words and 2-letter combinations, called also hashes (not in Perl sense, certainly ;-).

The package has dictionaries for 300 most frequent Russian and Ukrainian words and for nearly 600 most frequent Russian and Ukrainian 2-letter combinations, built by myown (the input texts were maybe not be very typical for Internet authors but any linguist can assure you this is not very principal: first hundreds of the most popular words in any language are very stable, nothing to say about letter combinations).

Also the text is analyzed second time (this shouldn't take too much time as we may get into situation like this only in case of a very short text); all the Cyrillic letters analized, no matter in which capitalization they are. If we found at least one word - the coding is determined on it, otherwise - on comparison of letter hashes.

In some very rare cases (usually in a very artificial situation when we have only one short word written all in lower- or uppercase) the statistics on several codings are equal. In this case we prefer windows-1251 to mac, koi8-r to koi8-u and - if nothing helps - windows-1251 to koi8-r.

To judge about which algorithm was applied you may wish to analyze the 4th variable, returned by the function Detect - $Algorithm. More detailed explanation of it is in the table Algorithm codes explanation.

REFERENCE INFORMATION

Modern Cyrillic codings and where are they used

    The supported codings are:

    * windows-1251

    This is the most popular Cyrillic coding nowadays, used on nearly 99% PC's. Full alphabet starts with code 192 (uppercase A), like most Microsoft character sets for national languages, and ends with code 255 (lowercase ya). Contains also characters for Ukrainian, Byelorussian and other languages based on Cyrillic alphabet. Can be easily sorted etc.

    * koi8-r

    Transliterated coding; terrible remnant of the old 7-bit world. First another coding - koi7-r was designed, where Russian characters were on places of similar English ones, for example Russian A on place of English A, Russian R (er) on place of English R etc. Even if there were no Cyrillic fonts at all the text still stayed readable. Koi8-r is in essence the same archeologic Koi7-r with characters shifted to the extended part of ASCII table. Koi8-r is still used on Unix-based computers, therefore it is the second popular Russian coding on the Net.

    * koi8-u

    The same as koi8-r, but with Ukrainian characters added.

    * utf-8

    A good textual representation of Unicode text. Basic characters (codes < 128) are represented with one-byte codes, while all other languages except Oriental ones - with two-byte sequences. See RFC 2279 'UTF-8, a transformation format of ISO 10646' for detailed information.

    * iso-8859-5

    Though this coding is approved by ISO, it is used only on some rare Unix systems, for example on Russian Solaris. For my whole life on the Net I have met only one or two guys working on computers like these.

    * cp866

    Also called 'alternative' coding. Used under DOS and Russian OS/2.

    * x-mac-cyrillic

    Macintosh coding. Lowercase letters almost completely coincide with Windows-1251 (except 2 characters) so in some rare cases x-mac-cyrillic can be confused with windows-1251. On the Internet this coding has almost died out; its share is absolutely insignificant. On PC platform it is supported by default only under Windows NT+.

Algorithm codes explanation

HISTORY

December 01, 2002 - Extensive Russian documentation added. Version changed to 0.02. November 19, 2002 - version 0.01 released. TODO

The author: B, Russia, Moscow. My home phone is I<(095) 468-95-63> Web-site: http://www.bible.ru/DetectCyrillic/ CPAN address: http://search.cpan.org/author/RUDENKO/ Email: rudenko@bible.ru Copyright (c) 2002 Alexei Rudenko. All rights reserved.

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 590:

You can't have =items (as at line 596) unless the first thing after the =over is an =item

Around line 773:

You can't have =items (as at line 777) unless the first thing after the =over is an =item