The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Unicode::Normalize - normalized forms of Unicode text

SYNOPSIS

  use Unicode::Normalize;

  $string_NFD  = NFD($raw_string);  # Normalization Form D
  $string_NFC  = NFC($raw_string);  # Normalization Form C
  $string_NFKD = NFKD($raw_string); # Normalization Form KD
  $string_NFKC = NFKC($raw_string); # Normalization Form KC

   or

  use Unicode::Normalize 'normalize';

  $string_NFD  = normalize('D',  $raw_string);  # Normalization Form D
  $string_NFC  = normalize('C',  $raw_string);  # Normalization Form C
  $string_NFKD = normalize('KD', $raw_string);  # Normalization Form KD
  $string_NFKC = normalize('KC', $raw_string);  # Normalization Form KC

DESCRIPTION

Normalization

$string_NFD = NFD($raw_string)

returns the Normalization Form D (formed by canonical decomposition).

$string_NFC = NFC($raw_string)

returns the Normalization Form C (formed by canonical decomposition followed by canonical composition).

$string_NFKD = NFKD($raw_string)

returns the Normalization Form KD (formed by compatibility decomposition).

$string_NFKC = NFKC($raw_string)

returns the Normalization Form KC (formed by compatibility decomposition followed by canonical composition).

$normalized_string = normalize($form_name, $raw_string)

As $form_name, one of the following names must be given.

  'C'  or 'NFC'  for Normalization Form C
  'D'  or 'NFD'  for Normalization Form D
  'KC' or 'NFKC' for Normalization Form KC
  'KD' or 'NFKD' for Normalization Form KD

Character Data

These functions are interface of character data used internally. If you want only to get unicode normalization forms, you need not to call them by yourself.

$canonical_decomposed = getCanon($codepoint)
$compatibility_decomposed = getCompat($codepoint)

If the character of the specified codepoint is canonically or compatibility decomposable (including Hangul Syllables), returns the completely decomposed string equivalent to it.

e.g. getCanon(0x1F82) returns "\x{03B1}\x{0313}\x{0300}\x{0345}", but not "\x{1F02}\x{0345}", where

    1F82; GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI
    1F02; GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA
    03B1; GREEK SMALL LETTER ALPHA
    0313; COMBINING COMMA ABOVE
    0300; COMBINING GRAVE ACCENT
    0345; COMBINING GREEK YPOGEGRAMMENI

If it is not decomposable, returns undef.

$uv_composite = getComposite($uv_here, $uv_next)

If the couple of two characters here and next (as codepoints) is composable (including Hangul Jamo/Syllables and Exclusions), returns the codepoint of the composite.

e.g. getComposite(0x0041, 0x0300) returns 0x00C0, where

    00C0; LATIN CAPITAL LETTER A WITH GRAVE
    0041; LATIN CAPITAL LETTER A
    0300; COMBINING GRAVE ACCENT

If they are not composable, returns undef.

$combining_class = getCombinClass($codepoint)

Returns the combining class as integer of the character.

$is_exclusion = isExclusion($codepoint)

Returns a boolean whether the character of the specified codepoint is a composition exclusion.

EXPORT

NFC, NFD, NFKC, NFKD: by default.

normalize and other some functions: on request.

AUTHOR

SADAHIRO Tomoyuki, <SADAHIRO@cpan.org>

  http://homepage1.nifty.com/nomenclator/perl/

  Copyright(C) 2001, SADAHIRO Tomoyuki. Japan. All rights reserved.

  This program is free software; you can redistribute it and/or 
  modify it under the same terms as Perl itself.

SEE ALSO

http://www.unicode.org/unicode/reports/tr15/

Unicode Normalization Forms - UAX #15

Lingua::KO::Hangul::Util

utility functions for Hangul Syllables