Unicode::Decompose - Unicode decomposition and normalization


  use Unicode::Decompose qw(normalize order decompose recompose normalize_d);

  $norm   = normalize($string);
  # OR:

  $decomp  = decompose($string); 
  $ordered = order($decomp);
  $norm    = recompose($ordered);


This module implements Unicode normalization forms D and C.

These are important for comparing Unicode strings: consider the two strings "\N{LATIN SMALL LETTER E WITH ACUTE}", and "\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE ACCENT}". From one point of view, (simply looking at the characters and the bytes in the string) they're differnet; from another, (looking at the meaning of the characters) they're the same.

Normalization is the process described in Unicode Technical Report #15 by which these two strings are made equal. There are two modes of doing this that we particularly care about: Unicode Normalization Form D is the "weaker" form, and C the stronger form.

Both have two stages in common: In the first stage, the data is "pulled apart", or decomposed. That is, precomposed characters such as "LATIN SMALL LETTER E WITH ACUTE", are split into a main character and the combining characters that follow it. In the next stage, the combining characters are ordered according to a list of priorities defined in the Unicode Character Database. This will make our two example strings both "LATIN SMALL LETTER E, COMBINING ACUTE ACCENT", and will hence compare equal.

Unicode Normalization Form C then takes the resulting string and "pushes together" the data, recomposing it; that is, characters may be returned to precomposed forms - because the combining characters have been rearranged, this might not be the same as the original precomposed characters.

Support for compatiblity decomposition, which is considerably more relaxed about how characters decompose, is implemented at a very rough level but not made available to the end-user at this time. If you want it, you should be able to figure it out from the code.


Creating the initial data structures is slow. Maybe move to Storable; the "cooked" data could be installed with the module.


See list in synopsis.


Simon Cozens,