The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Encode::Supported -- Supported encodings by Encode

DESCRIPTION

Encoding Names

Encoding names are case insensitive. White space in names is ignored. In addition an encoding may have aliases. Each encoding has one "canonical" name. The "canonical" name is chosen from the names of the encoding by picking he first in the following sequence:

       o The MIME name as defined in IETF RFCs.
       o The name in the IANA registry.
       o The name used by the organization that defined it.

Because of all the alias issues, and because in the general case encodings have state, "Encode" uses the encoding object internally once an operation is in progress.

Supported Encodings

As of Perl 5.8.0, at least the following encodings are recognized. Note that unless otherwise specified, they are all case insensitive (via alias) and all occurance of spaces are replaced with '-'. In other words, "ISO 8859 1" and "iso-8859-1" are identical.

Encodings are categorized and implemented in several different modules but you don't have to use Encode::XX to make them available for most cases. Encode.pm will automatically load those modules in need.

Built-in Encodings

The following encodings are always available.

  Canonical     Aliases
  -----------------------
  iso-8859-1    latin1
  US-ascii      ascii
  UCS-2         ucs2, iso-10646-1
  UCS-2le
  UTF-8         utf8
  -----------------------

Encode::Byte

The following encodings are based single-byte encoding implemented as extended ASCII. For most cases it uses \x80-\xff (upper half) to map non-ASCII characters.

  -----------------------
  iso-8859-1    latin
  iso-8859-2    latin2
  iso-8859-3    latin3
  iso-8859-4    latin4
  iso-8859-5    latin
  iso-8859-6    latin
  iso-8859-7
  iso-8859-8
  iso-8859-9    latin5
  iso-8859-10   latin6
  iso-8859-11
  (iso-8859-12 is nonexistent)
  iso-8859-13   latin7
  iso-8859-14   latin8
  iso-8859-15   latin9
  iso-8859-16   latin10

  koi8-f
  koi8-r
  koi8-u

  viscii        # ASCII + vietnamese

  cp1250        WinLatin2
  cp1251        WinCyrillic
  cp1252        WinLatin1
  cp1253        WinGreek
  cp1254        WinTurkiskh
  cp1255        WinHebrew
  cp1256        WinArabic
  cp1257        WinBaltic
  cp1258        WinVietnamese
  # all cp* are also available as ibm-* and ms-*

  maccentraleuropean  
  maccroatian
  macroman
  maccyrillic
  macromanian
  macdingbats       
  macsami
  macgreek 
  macthai
  macicelandic    
  macturkish
  macukraine
  -----------------------

The CJK: Chinese, Japanese, Korean (Multibyte)

Note Vietnamese is listed above. Also read "Encoding vs Charset" below. Also note these are impelemented in distinct module by languages, due the the size concerns. See these perldocs also.

Encode::CN -- Continental China
  -----------------------
  cp936      gbk                    
  euc-cn
  gb12345
  gb2312
  hz
  iso-ir-165
  -----------------------
Encode::JP -- Japan
  -----------------------
  7bit-jis        jis
  cp932
  euc-jp          ujis
  iso-2022-jp
  macjapan
  shiftjis        Shift_JIS, sjis
  -----------------------
Encode::KR -- Korea
  -----------------------
  euc-kr
  ksc5601
  cp949
  -----------------------
Encode::TW -- Taiwan
  -----------------------
  big5
  big5-hkscs
  cp950
  -----------------------
Encode::HanExtra -- More Chinese via CPAN

Due to size concerns, additional Chinese encodings below are distributed separately on CPAN, under the name Encode::HanExtra.

  -----------------------
  gb18030
  euc-tw
  big5plus
  -----------------------

Miscellaneous encodings

Encode::EBCDIC

See perlebcdic for details.

  -----------------------
  cp1047
  cp37
  posix-bc
  -----------------------
Enocode::Symbols

For symbols and dingbats.

  -----------------------
  symbol
  dingbats
  -----------------------

Encoding vs. Charset

Character encoding (or just "encoding") and Character Set (or just "charset") are often used interchangeably but they are different concepts.

Charset determines which characters to be included in a given text.

Encoding actually maps charset(s) to stream of bits.

Note a given encoding contains multiple charsets. For instance, euc-jp contains ASCII, JIS X 0201 (Hankaku Kana), JIS X 0208 (Zenkaku Kana and Kanji) and JIS X 0212 (Extended Kanji) in a single encoding.

As the name suggests, the Encode module supports encodings, not individual charsets.

Encoding Classification (by Anton Tagunov)

Encodings

  US-ASCII    UTF-8       KOI8-R      ISO-8859-*
  ISO-2022-CN ISO-2022-JP Big5
  EUC-CN      EUC-JP      EUC-KR

are <http://www.iana.org/assignments/character-sets>-registered as preferred MIME names and may probably be used over the Internet. So is

  Shift_JIS

but despite its wide spread it bears the label of being Microsft proprietary -- was. Now Shift JIS is official as of JIS X 0208-1997.

         UTF-16 KOI8-U

are IANA-registered preferred MIME names but probably shoule be avoided as encoding for web pages due to lack of browser support.

  ISO-2022      (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
  ISO-2022-JP-1 (http://www.faqs.org/rfcs/rfc2237.html)
  ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
  GBK
  VISCII
  GB 12345      (only plains 1 and 2 available)
  GB 18030
  CNS 11643

are totally valid encodings but not registered at IANA.

   BIG5PLUS
   EUC-JP-0212   (Encode::lib::Encode::Tcl::Extended)

are a bit proprietary

You may probably get some info on CJK encodings at

brief description for most of the mentioned CJK encodings

http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html

several years old, but still useful

http://www.oreilly.com/people/authors/lunde/cjk_inf.html

and some in-depth reading for the heroes :-) http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM (eq ISO-2022)

See Also

Encode, Encode::Byte, Encode::CN, Encode::JP, Encode::KR, Encode::TW Encode::EBCDIC, Encode::Symbol