CCCP::Encode - Perl extension for character encodings from utf-8 to any cyrillic (koi8-r, windows-1251, etc.)
Version 0.03
use CCCP::Encode; $CCCP::Encode::ToText = 0; # default $CCCP::Encode::Entities = 'xml'; # default my $str = "если в слове 'хлеб' поменять 4 буквы, то получится — ПИВО"; print CCCP::Encode->utf2cyrillic($str,'koi8-r'); # output in koi8-r: # если в слове 'хлеб' поменять 4 буквы, то получится — ПИВО $str = "Иероглифы: 牡 マ キ グ ナ ル フ"; print CCCP::Encode->utf2cyrillic($str,'windows-1251'); # output in windows-1251: # Иероглифы: 牡 マ キ グ ナ ル フ -------------------------- $CCCP::Encode::ToText = 0; # default $CCCP::Encode::Entities = 'html'; print CCCP::Encode->utf2cyrillic($str,'koi8-r'); # output in koi8-r: # если в слове 'хлеб' поменять 4 буквы, то получится — ПИВО $str = "Иероглифы: 牡 マ キ グ ナ ル フ"; print CCCP::Encode->utf2cyrillic($str,'windows-1251'); # output in windows-1251: # Иероглифы: 牡 マ キ グ ナ ル フ -------------------------- $CCCP::Encode::ToText = 1; print CCCP::Encode->utf2cyrillic($str,'koi8-r'); # output in koi8-r: # если в слове 'хлеб' поменять 4 буквы, то получится -- ПИВО $CCCP::Encode::CharMap = {"\x{2014}" => '-'}; print CCCP::Encode->utf2cyrillic($str,'koi8-r'); # output in koi8-r: # если в слове 'хлеб' поменять 4 буквы, то получится - ПИВО
This module convert utf string to cyrillic in two mode:
convert to cyrillic string with html entites,
convert to cyrillic string to only plain/text character.
By default for unknown character used HTML::Entities for html entites and for plain/text encoding used Text::Unidecode. You can override the map to encoding for any character. And can override regexp for replace character.
HTML::Entities
Text::Unidecode
Ajax library (on frontend) send data in utf-8. If you have backend on koi8-r, windows-1251, etc. You have problem:
koi8-r
windows-1251
use Encode; ... my $data = $post->param('any'); # $data = "если в слове 'хлеб' поменять 4 буквы, то получится — ПИВО"; Encode::from_to($data,'utf-8','koi8-r'); print $data; # output: # если в слове 'хлеб' поменять 4 буквы, то получится ? ПИВО
Method from_to from module Encode replace uncnown character on '?'. This data go to save in your database. And you write a guano-magic code for fixing this problem. All developers, who have database not in utf, known about this problem.
from_to
Encode
And another case:
Getting data from rss-channels in utf-8 and saving in cyrillic database (for example mysql with default charset koi8-r or windows-1251).
cyrillic
CCCP::Encode fix this problem.
$str target string. $to encoding name, analogue $to in Encode::from_to($str,'utf-8',$to)
$str
$to
Encode::from_to($str,'utf-8',$to)
Ignored if $CCCP::Encode::ToText is true. Default value 'xml'. 'xml' mode - replace all uncnown character in traget charset to valid xml numeric entities (i.e. —). 'html' mode - replace all uncnown character in traget charset to html numeric entities (i.e. —).
Default is false.
If $CCCP::Encode::ToText is false, when utf2cyrillic return decode string whis replace uncnown character from you definition (see $CCCP::Encode::CharMap) or html entities from HTML::Entities.
$CCCP::Encode::ToText
utf2cyrillic
$CCCP::Encode::CharMap
If $CCCP::Encode::ToText is true, when utf2cyrillic return decode string in plain/text format whis replace uncnown character from you definition (see $CCCP::Encode::CharMap) or used Text::Unidecode.
Default is empty hashref.
You can custom define map for any characters. This is wery flexible if you need custom replace (different of HTML::Entities or Text::Unidecode). Example:
$CCCP::Encode::CharMap = { "\x{2014}" => '-', "\x{2015}" => 'foo' };
By default value is [^\p{Cyrillic}|\p{IsLatin}|\p{InBasic_Latin}] - replace any character which not in Cyrillic or Latin map exist. You can override this expression.
[^\p{Cyrillic}|\p{IsLatin}|\p{InBasic_Latin}]
See more on http://www.regular-expressions.info/unicode.html
http://www.regular-expressions.info/unicode.html
CCCP::Encode with $CCCP::Encode::Entities eq "html": 2 wallclock secs ( 1.63 usr + 0.01 sys = 1.64 CPU) @ 60975.61/s (n=100000) CCCP::Encode with $CCCP::Encode::Entities eq "xml": 3 wallclock secs ( 2.49 usr + 0.00 sys = 2.49 CPU) @ 40160.64/s (n=100000) CCCP::Encode with $CCCP::Encode::ToText eq "1": 4 wallclock secs ( 3.85 usr + 0.02 sys = 3.87 CPU) @ 25839.79/s (n=100000) Encode::from_to(...) : 2 wallclock secs ( 1.93 usr + 0.01 sys = 1.94 CPU) @ 51546.39/s (n=100000)
Ivan Sivirinov
To install CCCP::Encode, copy and paste the appropriate command in to your terminal.
cpanm
cpanm CCCP::Encode
CPAN shell
perl -MCPAN -e shell install CCCP::Encode
For more information on module installation, please visit the detailed CPAN module installation guide.