The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

CCCP::Encode - Perl extension for character encodings from utf-8 to any cyrillic (koi8-r, windows-1251, etc.)

Version 0.03

SYNOPSIS

    use CCCP::Encode;
    
    $CCCP::Encode::ToText = 0; # default
    $CCCP::Encode::Entities = 'xml'; # default    
    my $str = "если в слове 'хлеб' поменять 4 буквы, то получится — ПИВО";
    print CCCP::Encode->utf2cyrillic($str,'koi8-r');
    # output in koi8-r:
    # если в слове 'хлеб' поменять 4 буквы, то получится — ПИВО
    
    $str = "Иероглифы: 牡 マ キ グ ナ ル フ";
    print CCCP::Encode->utf2cyrillic($str,'windows-1251');
    # output in windows-1251:
    # Иероглифы: 牡 マ キ グ ナ ル フ 
        
        --------------------------
        
        $CCCP::Encode::ToText = 0; # default
        $CCCP::Encode::Entities = 'html';            
    print CCCP::Encode->utf2cyrillic($str,'koi8-r');
    # output in koi8-r:
    # если в слове 'хлеб' поменять 4 буквы, то получится — ПИВО

    $str = "Иероглифы: 牡 マ キ グ ナ ル フ";
    print CCCP::Encode->utf2cyrillic($str,'windows-1251');
    # output in windows-1251:
    # Иероглифы: 牡 マ キ グ ナ ル フ
    
    --------------------------
         
    $CCCP::Encode::ToText = 1;
    print CCCP::Encode->utf2cyrillic($str,'koi8-r');
    # output in koi8-r:
    # если в слове 'хлеб' поменять 4 буквы, то получится -- ПИВО  
    
    $CCCP::Encode::CharMap = {"\x{2014}" => '-'};
    print CCCP::Encode->utf2cyrillic($str,'koi8-r');
    # output in koi8-r:
    # если в слове 'хлеб' поменять 4 буквы, то получится - ПИВО  
    

DESCRIPTION

This module convert utf string to cyrillic in two mode:

  • convert to cyrillic string with html entites,

  • convert to cyrillic string to only plain/text character.

By default for unknown character used HTML::Entities for html entites and for plain/text encoding used Text::Unidecode. You can override the map to encoding for any character. And can override regexp for replace character.

INTRODUCTION

Ajax library (on frontend) send data in utf-8. If you have backend on koi8-r, windows-1251, etc. You have problem:

    use Encode;
    ...
    my $data = $post->param('any');
    # $data = "если в слове 'хлеб' поменять 4 буквы, то получится — ПИВО";
    Encode::from_to($data,'utf-8','koi8-r');
    print $data;
    # output:
    # если в слове 'хлеб' поменять 4 буквы, то получится ? ПИВО

Method from_to from module Encode replace uncnown character on '?'. This data go to save in your database. And you write a guano-magic code for fixing this problem. All developers, who have database not in utf, known about this problem.

And another case:

Getting data from rss-channels in utf-8 and saving in cyrillic database (for example mysql with default charset koi8-r or windows-1251).

CCCP::Encode fix this problem.

METHODS

utf2cyrillic($str,$to)

$str target string. $to encoding name, analogue $to in Encode::from_to($str,'utf-8',$to)

PACKAGE VARIABLES

$CCCP::Encode::Entities

Ignored if $CCCP::Encode::ToText is true. Default value 'xml'. 'xml' mode - replace all uncnown character in traget charset to valid xml numeric entities (i.e. —). 'html' mode - replace all uncnown character in traget charset to html numeric entities (i.e. —).

$CCCP::Encode::ToText

Default is false.

If $CCCP::Encode::ToText is false, when utf2cyrillic return decode string whis replace uncnown character from you definition (see $CCCP::Encode::CharMap) or html entities from HTML::Entities.

If $CCCP::Encode::ToText is true, when utf2cyrillic return decode string in plain/text format whis replace uncnown character from you definition (see $CCCP::Encode::CharMap) or used Text::Unidecode.

$CCCP::Encode::CharMap

Default is empty hashref.

You can custom define map for any characters. This is wery flexible if you need custom replace (different of HTML::Entities or Text::Unidecode). Example:

    $CCCP::Encode::CharMap = {
        "\x{2014}" => '-',
        "\x{2015}" => 'foo'
    };

$CCCP::Encode::Regexp

By default value is [^\p{Cyrillic}|\p{IsLatin}|\p{InBasic_Latin}] - replace any character which not in Cyrillic or Latin map exist. You can override this expression.

See more on http://www.regular-expressions.info/unicode.html

OVERHEAD

    CCCP::Encode with $CCCP::Encode::Entities eq "html":  
        2 wallclock secs ( 1.63 usr +  0.01 sys =  1.64 CPU) @ 60975.61/s (n=100000)
    
    CCCP::Encode with $CCCP::Encode::Entities eq "xml":  
        3 wallclock secs ( 2.49 usr +  0.00 sys =  2.49 CPU) @ 40160.64/s (n=100000)
    
    CCCP::Encode with $CCCP::Encode::ToText eq "1":  
        4 wallclock secs ( 3.85 usr +  0.02 sys =  3.87 CPU) @ 25839.79/s (n=100000)
            
    Encode::from_to(...) :  
        2 wallclock secs ( 1.93 usr +  0.01 sys =  1.94 CPU) @ 51546.39/s (n=100000)

SEE ALSO

  • Encode

  • Text::Unidecode

AUTHOR

Ivan Sivirinov