The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

The Parrot String API

This document describes how Parrot abstracts the programmer's interface to string types. All strings used in the Parrot core should use the Parrot STRING structure; Parrot programmers should not deal with char * or other string-like types outside of this abstraction without very good reason.

Interface functions on STRINGs

In fact, programmers should hardly ever even access members of the STRING structure directly. The reason for this is that the interpretation of the data inside the structure will be a function of the data's encoding. The idea is that Parrot's strings are encoding-aware so your functions don't need to be; if you break the abstraction, you suddenly have to start worrying about what the data actually means.

String Constructors

The most basic way of creating a string is through the function string_make:

    STRING* string_make(void *buffer, IV buflen, IV encoding, IV flags, IV type)

In here you pass a pointer to a buffer of a given encoding, and the number of bytes in that buffer to examine, the encoding, (see below for the enum which defines the different encodings) and the initial values of the flags and type field. These should usually be zero. In return, you'll get a brand new Parrot string. This string will have its own private copy of the buffer, so you don't need to keep it.

  • Hint: Nothing stops you doing

        string_make(NULL, 0, ... 

If you already have a string, you can make a copy of it by calling

    STRING* string_copy(STRING* s)

This is itself implemented in terms of string_make.

When a string is done with, it can be destroyed using the destroyer

    void string_destroy(STRING *s)

String Manipulation Functions

Unless otherwise stated, all lengths, offsets, and so on, are given in characters; you are not allowed to care about the byte representation of a string, so it doesn't make sense to give the values in bytes.

To find out the length of a string, use

    IV string_length(STRING *s)

You may explicitly use s->strlen for this since it is such a useful operation.

To concatenate two strings - that is, to add the contents of string b to the end of string a, use:

    STRING* string_concat(STRING* a, STRING *b, IV flag)

a is updated, and is also returned as a convenience. If the flag is set to a non-zero value, then b will be transcoded to a's encoding before concatenation if the strings are of different encodings. You almost certainly don't want to stick, say, a UTF-32 string on the end of a Big-5 string.

Chopping n characters off the end of a string is achieved with the unlikely-sounding

    STRING* string_chopn(STRING* s, IV n)

Not implemented: To retrieve a substring of the string, call

    STRING* string_substr(STRING* src, IV offset, IV length, STRING** dest)

The result will be placed in dest. (Passing in dest avoids allocating a new string at runtime. If *dest is a null pointer, a new string structure is created with the same encoding as src.)

Not implemented: To format output into a string, use

    STRING* string_nprintf(STRING* dest, IV len, char* format, ...) 

dest may be a null pointer, in which case a new native string will be created. If len is zero, the behaviour becomes more sprintfish than snprintf-like.

Elements of the STRING structure

Those implementing the STRING API will obviously need to know about how the STRING structure works. You can find the definition of this structure in string.h:

    struct parrot_string {
      void *bufstart;
      IV buflen;
      IV bufused;
      IV flags;
      IV strlen;
      IV encoding;
      IV type;
      IV unused;
    };

Let's look at each element of this structure in turn.

bufstart

This pointer points to the buffer which holds the string, encoded in whatever is the string's specified encoding. Because of this, you should not make any assumptions about what's in the buffer, and hence you shouldn't try and access it directly.

buflen

This is used for memory allocation; it tells you the currently allocated size of the buffer in bytes.

bufused

bufused on the other hand, contains the number of bytes out of the allocated buffer which are actually in use. This, together with buflen, is used by the buffer growing algorithm to determine when and by how much to grow the allocation buffer.

flags

This is a general holding area for string flags. The exact flags required have not yet been determined.

strlen

This is the length of the string in characters, as you would expect to find from length $string in Perl. Again, because string buffers may be in one of a number of encodings, this must be computed by the appropriate encoding function. string_compute_strlen(STRING) updates this value, calling the compute_strlen function in the STRING's vtable.

encoding

This specifies the encoding of the buffer, from the following enum:

    enum {
        enc_native,
        enc_utf8,
        enc_utf16,
        enc_utf32,
        enc_foreign,
        enc_max
    };

The "native" string type is whatever happens when you set LANG=C in your shell; it's usually ISO-8859-1 in most English-speaking machines. A character equals a byte equals eight bits. No shifts, no wide characters, nothing.

UTF8, UTF16, and UTF32 are what they sound like. UTF16 and UTF32 should use the native endianness of the machine.

enc_foreign is there to allow for expansion; foreign strings will call functions from a user-defined string vtable instead of the Perl built-in ones.

enc_max isn't an encoding. These aren't the droids you're looking for. It's just there to help know how big to make arrays.

type

XXX I don't know what this is for.

unused

This field is, as its name suggests, unused; however, it can be used to hold a pointer to the correct vtable for foreign strings.

String Vtable Functions

The "String Manipulation Functions" above are implemented in terms of string vtables to create encoding abstraction; here's an example of one:

    STRING*
    string_concat(STRING* a, STRING* b, IV flags) {
        return (ENC_VTABLE(a).concat)(a, b, flags);
    }

ENC_VTABLE(a) is shorthand for:

    Parrot_string_vtable[a->encoding]

The Parrot_string_vtable is a static array of virtual tables, defined in string.c. Each encoding has its own vtable; to call the concatenation function for a, we look up its encoding and retrieve the concat entry from that encoding's vtable. This produces a function pointer we can throw the arguments at.

Most of the string vtable functions are self-explanatory as they are thin wrappers around the functions given above. Some of them, however, are for internal use only, to help implement other functions. You'll find them in the next section.

How to add new vtable functions

The first thing to note is that if what you're doing isn't remotely encoding-specific, you don't need to add a vtable function; you can just add a function in string.c (don't forget to add the function prototype to string.h) and you don't need any more of this section. However, most things that people do with strings depend on the encoding of the string data, so if you need to add anything slightly complex, read on.

Currently, the construction of the vtables is not automated; it's hoped that soon someone will automate this and fix this section. However, for the time being, this is what you need to do when you implement a new vtable function:

  1. Check to see whether or not the function's type has a typedef in string.h: for instance, if you have a function that takes a string and an IV and returns a string, use string_iv_to_string_t; otherwise, add your own type.

  2. Add the unqualified name of the function (frobnicate), together with your type, to string_vtable in string.h.

  3. Create a function string_frobnicate in string.c which is a wrapper around frobnicate. This function must take a STRING* parameter, so that the encoding can be extracted and the relevant encoding vtable be found and despatched. It should look something like this:

        yadda
        string_frobnicate(STRING *s, ...) {
            return (ENC_VTABLE(s).frobnicate)(s, ...);
        }
  4. Create functions string_XXX_frobnicate for all values of XXX in the encoding table; (or better still, get other people to write them for you) string_native_frobnicate should go in strnative.c, string_utf8_frobnicate should go in strutf8.c, and so on.

  5. Add string_XXX_frobnicate to the end of each vtable returned by string_XXX_vtable.

Non-user-visible String Manipulation Functions

If you've read this far, I hope you're a Parrot implementor. If you're not helping construct the Parrot core itself, you probably want to look away now.

The first two functions to note are

    IV string_compute_strlen(STRING* s)

and

    IV string_max_bytes(STRING *s, IV iv)

The first updates the contents of s->strlen by contemplating the buffer bufstart and working out how many characters it contains. The second is given a number of characters which we assume are going to be added into the string at some point; it returns the maximum number of bytes that need to be allocated to admit that number of characters. For fixed-width encodings, this is trivial - the "native" encoding, for instance, encodes one byte per character, so string_native_max_bytes simply returns the IV it is passed; string_utf8_max_bytes, on the other hand, returns three times the value that it is passed because a UTF8 character may occupy up to three bytes.

To grow a string to a specified size, use

    void string_grow(STRING *s, IV newsize)

The size is given in characters; string_max_bytes is called to turn this into a size in bytes, and then the buffer is grown to accomodate (at least) that many bytes.

Transcoding

The fact that Parrot strings are encoding-abstracted really has to bottom out at some point, and it's usually when two strings of different encodings interact. When we try to append one type of string to another, we have the option of turning the later string into a string that matches the first string's encoding. This process, translating a string from one encoding into another, is called "transcoding".

In Parrot, transcoding is implemented by the two-dimensional array

    Parrot_transcode_table[enc_from][enc_to]

Each entry in this table is a function pointer which takes two parameters:

    string_utf32_to_utf8(STRING* from, STRING* to)

(If to is a null pointer, a new STRING* will be allocated. As before, it's all about avoiding memory allocation at runtime.)

A null pointer in the table should signify that no transcoding is necessary; Parrot_transcode_table[x][x] should always be NULL.

Parrot_transcode_table[enc_native][enc_utf8] isn't NULL. Don't fall for that, because "native" doesn't necessarily mean ISO-8859-1.

Foreign Encodings

Fill this in later; if anyone wants to implement new encodings at this stage they must be mad.