The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

docs/pdds/pdd28_character_sets.pod - Strings and character sets

ABSTRACT

This PDD describes the conventions expected for users of Parrot strings, including but not limited to support for multiple character sets, encodings and languages.

VERSION

$Revision: 26242 $

DESCRIPTION

Here is a summary of the design decisions described in this PDD.

  • Parrot supports multiple string formats, and so users of Parrot strings must be aware at all times of string encoding issues and how these relate to the string interface.

  • The native Parrot string format is an array of 32-bit Unicode codepoints in grapheme normalization form. (NFG)

  • NFG is defined as a normalization which allocates at most one codepoint to each visible character.

  • An interface is defined for interacting with Parrot strings and converting between character sets and encodings.

Definitions

Here is a brief definition of some of the technical terms in this PDD. For more detailed description, see any of the links at the bottom of this PDD.

character

A character is "the abstract description of a symbol". It's the smallest chunk of text a computer knows how to deal with. Of course internally to the computer, a character (just like everything else) is a number, so you need to define...

character set

A character set is officially a deprecated term, with Unicode preferring the concepts of character repertoire (a collection of characters) and character code (a mapping which tells you what number represents which character in the repertoire). We still use it, though, to mean the standard which defines both a repertoire and a code.

codepoint

A codepoint is the numeric representation of a character according to a given character set. So in ASCII, the character A has codepoint 0x41.

combining character

A combining character is a Unicode concept. It is a character which modifies the preceding character. For instance, accents, lines, circles, boxes, etc. which are not to be displayed on their own, but to be "composed" with the preceding character.

grapheme

A grapheme is our concept. It is a character followed by all of its combining characters; on other words, one or more characters forming a visible whole when displayed. Parrot must support languages which manipulate strings grapheme-by-grapheme, and since graphemes are the highest-level interpretation of a "character", they're very useful for converting between character sets.

encoding

An encoding determines how a codepoint is represented inside a computer. Simple encodings like ASCII define that the codepoints 0-127 simply live as their numeric equivalents inside an eight-bit bytes. Other fixed-width encodings like UTF-16 use more bytes to encode more codepoints. Variable-width encodings like UTF-8 use one byte for codepoints 0-127, two bytes for codepoints 127-2047, and so on.

Encoding awareness

Parrot was designed from the outset to support multiple string formats: multiple character sets and multiple encodings. Unlike other such projects, we don't standardize on Unicode internally. This is because for the majority of use cases, it's still far more efficient to deal with whatever input data the user sends us, which, equally in the majority of use cases, is something like ASCII - or at least, some kind of byte-based rather than character-based encoding.

So internally, consumers of Parrot strings have to be aware that there is a plurality of string encodings going on inside Parrot. (Producers of Parrot strings can do whatever is most efficient for them.) The implications of this for the internal API will be detailed in the implementation section below, but to put it in simple terms: if you find yourself writing *s++ or any other C string idioms, you need to stop and think if that's what you really mean. Not everything is byte-based any more.

However, we're going to try to make it as easy for *s++-minded people as possible, and part of that is the declaration of a Parrot native string format. You don't have to use it, but if you do all your dreams will come true.

Native string format

Dealing with variable-byte encodings is not fun; for instance, you need to do a bunch of computations every time you traverse a string. In order to make programming a lot easier, we define a Parrot native string format to be an array of unsigned 32-bit Unicode codepoints. This is equivalent to UCS-4 except for the normalization form semantics described below.

This means that if you've done the necessary checks, and hence you know you're dealing with a Parrot native string, then you can continue to program in the usual C idioms - for the most part. Of course you'll need to be careful with your comparisons, since what you'll be getting back will be a Parrot_Rune instead of a char.

Grapheme normalization form

Unicode characters can be expressed in a number of different ways according to the Unicode Standard. This is partly to do with maintaining compatibility with existing character encodings. For instance, in Serbo-Croatian and Slovenian, there's a letter which looks like an i without the dot but with two grave (`) accents. If you have an especially good POD renderer, you can see it here: ȉ.

There are two ways you can represent this in Unicode. You can use character 0x209, also known as LATIN SMALL LETTER I WITH DOUBLE GRAVE, which does the job all in one go. This is called a "composed" character, as opposed to its equivalent decomposed sequence: LATIN SMALL LETTER I (0x69) followed by COMBINING DOUBLE GRAVE ACCENT (0x30F).

Unicode standardises in a number of "normalization forms" which repesentation you should use. We're using an extension of Normalization Form C, which says basically, decompose everything, then re-compose as much as you can. So if you see the integer stream 0x69 0x30F, it needs to be replaced by 0x209. This means that Parrot string data structures need to keep track of what normalization form a given string is in, and Parrot must provide functions to convert between normalization forms.

Now, Serbo-Croat is sometimes also written with Cyrillic letters rather than Latin letters. The Cyrillic equivalent of the above character is not part of Unicode, but would be specified as a decomposed pair CYRILLIC SMALL LETTER I (0x438) COMBINING DOUBLE GRAVE ACCENT (0x30F). (This PDD does not require Parrot to convert strings between differing political sensibilities.) However, it is still visible as one character and despite being expressed even in NFC as two characters, is still a single character as far as a human reader is concerned.

Hence we introduce the distinction between a "character" and a "grapheme". This is a Parrot distinction - it does not exist in the Unicode Standard.

When a regular expression engine from one of Parrot's target languages wishes to match a grapheme, then NFC is clearly not normalized enough. This is why we have defined a further normalization stage, NFG - Normalization Form for Graphemes.

NFG uses out-of-band signalling in the string to refer the conforming implementation to a decomposition table. UCS-4 specifies an encoding for Unicode codepoints from 0 to 0x7FFFFFFF. In other words, any codepoints with the first bit set are undefined. We define these out-of-band codepoints as indexes into a lookup table, which maps between a temporary ID and its associated decomposition.

In practice, this goes as follows: Assuming our Russified Serbo-Croat string is the first string that Parrot sees, when it is converted to Parrot's default format, it would be normalized to a single character having the codepoint 0xFFFFFFFFF. (In other words, -1; grapheme table entries count backwards to allow Parrot to check if a character is a grapheme using a single sign-comparison operation) At the same time, Parrot would insert an entry into the global grapheme table, Parrot_grapheme_table, at array index 0, consisting of the bytestream 0x00000438 0x000000030F - that is, the Unicode decomposition of the grapheme.

This has one big advantage: applications which don't care about graphemes can just pass the codepoint around as if it's any other number - uh, character. Only applications which care about the specific properties of Unicode characters need to take the overload of peeking inside the array and reading the decomposition.

Individual languages may need to think carefully about their concept of, for instance, "the length of a string" to determine whether or not they need to visit the lookup table for these strings. At any rate, Parrot should provide both grapheme-aware and codepoint-aware iterators for string traversal.

The Parrot internal character type

The other advantage of NFG is that it gives us a single value that can unambiguously be passed between any charset/encoding pair: any variable typed Parrot_Rune is defined to be a Parrot_UInt4 Unicode codepoint where values >= 0x80000000 are understood to be entries into the global Parrot_grapheme_table array.

Strings in Parrot's native string format will probably be an array of Parrot_Runes. (It is possible that native strings may want to carry around their own grapheme tables rather than use the global one; in which case, their codepoints are better off called Parrot_UInt4s, to reserve the interpretation of the Parrot_Rune type. But let's burn that bridge when we come to it.)

Because Parrot_Runes are a single unambiguous representation of a character at the highest level Parrot will be required to support - that is, in principle, any character from any character set can be represented by at most one Parrot_Rune - they are perfect as an intermediate character representation for converting between string types.

Some useful statistics for optimizers

I took a large sample of highly mixed Unicode data, generated by taking the contents of English Wikipedia and removing 99% of the pure-ASCII content, and looked at the distribution of UTF-8 bytes per character.

    1 bytes: 116324899
    2 bytes: 19229506
    3 bytes: 12198853
    4 bytes: 4170

There is a possible methodological problem here in that English Wikipedia is more likely to contain content from languages "close" to English and transliterations into Latin letters using accented forms rather than "exotic" scripts. Nevertheless, as a first approximation it appears that a majority of the world's data still fits in one byte of UTF-8, but once you pass one byte, two or three UTF-8 bytes are (very roughly speaking) equally likely.

IMPLEMENTATION

Division of labour between "charset" and "encoding"

Character sets and encodings are related but separate concepts. An encoding is the lower-level representation of a string's data, whereas the character set determines higher-level semantics. Typically, character set functions will ask a string's encoding functions to retrieve data from the string, and then process the retrieved data.

In (rare) cases where a character set has a particularly strong link to an encoding, (ISO8859-X with fixed 8-bit being the prime example) then the character set functions may, for optimization purposes, contain code which bypasses the encoding functions and handles the string data directly. However, an encoding check MUST still be made and equivalent code to do things "the slow way" must be included if the check fails.

The global grapheme table

When constructing strings in NFG, graphemes not expressible as a single character in Unicode are represented by a grapheme ID which looks up into the global grapheme table. When Parrot comes across a grapheme, it must first determine whether or not the grapheme already has an entry in the grapheme table. Therefore the table cannot strictly be an array, as that would make lookup inefficient. The grapheme table is represented, then, as both an array and a hash structure. The array gives forward-lookup and the hash allows reverse lookup. Converting a string of characters into a grapheme ID can be demonstrated with the following Perl 5 pseudocode, for the grapheme 0x438 0x30F:

   $codepoint = ($grapheme_lookup->{0x438}{0x30F} ||= do {
                   push @grapheme_table, "\x{438}\x{30F}";
                   ~ $#grapheme_table;
                });
   push @string, $codepoint; 

Implications for string implementation

To provide an area for encoding-specific information, such as the current normalisation form, the parrot_string_representation_t is to be re-worked as a void pointer to an encoding-specific representation structure. For instance, one such structure for the Parrot native encoding may look like this:

    typedef Parrot_UInt4 Parrot_Rune;

    typedef struct parrot_native_representation_t {
        normalization_form_t normalization,
        ...
    };

String access API

It is hard to fully lay out a list of what string functions will be required as this will obviously be determined by use. This PDD assumes for the moment that the current string functions will on the whole be maintained, but will be adapted to use the ideas described here - particularly the use of Parrot_Rune as an intermediary character representation, the addition of normalization form structures and the Parrot_grapheme_table.

Conversions between normalization form, encoding and charset

Given the existence of a single unambiguous character type, there is no need for a large and complicated set of string conversion functions. All conversion will be done with a function called string_slow_copy:

    INTVAL string_slow_copy(STRING* src, STRING* dst)

To convert a string from one format to another, simply create a new empty string with the required attributes, and pass the source string and this new string to string_slow_copy. This function creates two string iterators and calls get_and_advance on the source string's iterator to read each character in turn into a Parrot_Rune, and then calls set_and_advance on the destination string's iterator to append the character. This will intrinsically perform a conversion using Parrot_Runes as the intermediary.

Note that encoding_get_codepoint on strings which are not in NFG may need to read ahead multiple characters in order to turn them into a single grapheme, in order to return a Parrot_Rune.

String programming checklist

  • Are you manipulating the string data directly? It's OK to do this if you know that the string is in Parrot's native format and you're using Parrot_Runes. It's generally not OK to do this otherwise.

  • Related to the above, if you see *s++ in code you've just written, then warning bells should be sounding.

  • Parrot_Runes are guaranteed to be Unicode codepoints up to 0x7FFFFFFF and Parrot graphemes above that. Nothing else is guaranteed to be in Unicode; operations which require understanding the semantics of the string must go through the character set structure.

  • Similarly, operations which required understanding of the layout of the string must go through the encoding structure.

REFERENCES

http://plan9.bell-labs.com/sys/doc/utf.html - Plan 9's Runes are not dissimilar to Parrot's Runes, and this is a good introduction to the Unicode world. (Also available at http://sirviente.9grid.es/sources/plan9/sys/doc/utf.ps )

http://www.unicode.org/reports/tr15/ - The Unicode Consortium's explanation of different normalization forms.

"Unicode: A Primer", Tony Graham - Arguably the most readable book on how Unicode works.

"Advanced Perl Programming", Chapter 6, "Unicode"