The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Locale::MakePhrase - Language translation facility

SYNOPSIS

These group of modules are used to translate application text strings, which may or may not include values which also need to be translated, into the prefered language of the end-user.

Example:

  use Locale::MakePhrase::BackingStore::Directory;
  use Locale::MakePhrase;
  my $bs = new Locale::MakePhrase::BackingStore::Directory(
    directory => '/some/path/to/language/files',
  );
  my $mp = new Locale::MakePhrase(
    language => 'en_AU',
    backing_store => $bs,
  );
  ...
  my $color_count = 1;
  print $mp->translate("Please select [_1] colors.",$color_count);

Output:

  Please select a colour.

Notice that a) the word 'color' has been localised to Australian English, and b) that the argument has influenced the resultant output text to take into account the display of the singular version.

DESCRIPTION

This aim of these modules are to implement run-time evaluation of an input phrase, including program arguments, and have it generate a suitable output phrase, in the language and encoding specified by the user of the application.

Since this problem has been around for some time, there are a number of sources of useful information available on the web, which describes why this problem is hard to solve. The problem with most existing solutions is that each design suffers some form of limitation, often due to the designer thinking that there are enough commonalities between all/some langugaes that these commonalities can be factored into a various rules which can be implemented in programming code.

However, each language has it own history and evolution. Thus it is pointless to compare two different languages unless they have a common history and a common character set.

Before continuing to read this document, you really should read the following info on the Locale::Maketext Perl module:

  http://search.cpan.org/~sburke/Locale-Maketext-1.08/lib/Locale/Maketext.pod

and at the slides presented here:

  http://www.autrijus.org/webl10n/

The Locale::MakePhrase modules are based on a design similar to the Locale::Maketext module, except that this new implementation has taken a different approach, that being...

Since it is possible (and quite likely) that the application will need to be able to understand the language rules of any specific language, we want to use a run-time evaluation of the rules that a linguist would use to convert one language to another. Thus we have coined the term linguistic rules as a means to describe this technique. These rules are used to decide which piece of text is displayed, for a given input text and arguments.

REQUIREMENTS

The Locale::MakePhrase module was initially designed to meet the requirements of a web application (as opposed to a desktop application), which may display many languages in the HTML form at any given instance.

Its design is modelled on a similar design of using language lexicons, which is in use in the existing Locale::Maketext Perl module. The reason for building a new module is because:

  • We wanted to completely abstract the language rule capability, to be programming language agnostic so that we could re-implement this module in other programming languages.

  • We needed run-time evaluation of the rules, since the translations may be updated at any time; new rules may be added whenever there is some ambigutiy in the existing phrase. Also, we didn't want to re-start the application whenever we updated a rule.

  • We would like to support various types of storage mechanisms for the translations. The origonal design constraint prefered the use of a PostgreSQL database to hold the translations - most existing language translation systems use flat files.

  • We want to store/manipulate the current text phrase, only encoded in UTF-8 (ie: we dont want to store the text in a locale-specific encoding). This allows us to output text to any other character set.

As an example of application usage, it is possible for a Hebrew speaking user to be logged into a web-form which contains Japanese data. As such they will see:

  • Menus and tooltips will be translated into the users' language (ie: Hebrew).

  • Titles will be in the language of the dataset (ie: Japanese).

  • Some of the data was in Latin character set (ie: English).

  • If the user prefered to see the page as RTL rather than LTR, the page was altered to reflect this preference.

BACKGROUND

When implementing any new software, it is necessary to understand the problem domain. In the case of language translation, there are a number of requirements that we can define:

  1. Quite a few people speak multiple languages; we would like the language translation system to use the users preferred language localisation, or if we don't know which language that is, try to make an approximate guess, based on application capabilites.

    Eg:

    In a web-browser, the user normally sets their prefered language/dialect. The browser normally sends this information to a web-server during the request for a page. The server may choose to show the page contents in the language the user prefers.

  2. Since some people speak multiple languages, the application may not have been localised to their prefered localisation. We should try to fallback to using a language which is similar.

    Eg:

    If there are no Spanish translations available, we should fallback to Mexican, since Mexican and Spanish have many words in common.

  3. Some languages support the notion of a dialect for that language. A good example is that the English language is used in many countries, but countries such as the United States, Australia and Great Britain each have their own localised version ie. the dialect is specified as the country or region. The language translation mechanism needs to be able to use the users' preferred dialect when looking up the text to display. If no translation is found, then it should fall back to the parent language.

    Eg:

    The language/dialect of Australia is defined as 'en_AU' - when we lookup a text translation, if we fail we should try to lookup the 'en' translation.

  4. Some languages are written using a script which displays its output as right-to-left text (as used by Arabic, Hebrew, etc), rather than left-to-right text (as used by English, Latin, Greek, etc). The language translation mechanism should allow the text display mechanism to change the text direction if that is a requirement (which is another reason for mandating the use of UTF-8).

  5. The string to be translated should support the ability to re-order the wording of the text.

    Eg:

    In English we would normally say something like "Please enter your name"; in Japanese the equivalent translation would be something like "Enter your name, please" (although it would be in Japanese, not English).

  6. The text translation mechanism should support the ability to show arguments supplied to the string (by the application), within the correct context of the meaning of the string.

    Eg:

    We could say something like "You selected 4 balls" (where the number 4 is program dependant); in another language you may want to say the equivalent of "4 balls selected".

    Notice that the numeric position has moved from being the third mnemonic, to being the first mnemonic. The requirement is that we would like to be able to rearrange the order/placement of any mnemonic (including any program arguments).

  7. We would like to be able to support an arbitrary number of argument replacements. We shouldn't be limited in the number of replacements that need to occur, for any given number program arguments.

    Eg:

    We want to have an unlimited number of placeholders as exemplified by the string "Select __ balls, __ bats, __ wickets, plus choose __ people, ___ ..." and so on.

  8. Most program arguments that are given to strings are in numeric format (i.e. they are a number). We would also like to support arguments which are text strings, which themselves should be open to language translation (but only after rule evaluation). The purpose being that the output phrase should make sense within the current context of the application.

  9. In a lot of languages there is the concept of singular and plural. While in other languages there is no such concept, while in others still there is the concept of duality. There is also the concept that a phrase can be descriptive when discussing the zero of something. Thus we want to display a specific phrase, depending on the value of an argument.

    Eg:

    In English, the following text "Selected __ files" has multiple possible outputs, depending on the program value; we can have:

     0 case: "No files selected" - no numeric value
     1 case: "One file selected" - 'files' is singular
     2 case: "Selected two files" - the '__' is a text value, not a number
     more than 2 case: "Lots of selections" - no direct comparison to the original text

    ...as we can see, this is just for translating a single text string, from English to English.

    To counter this problem, the translation system needs to be able to apply linguistic rules to the original text, so that it can evaluate which piece of text should be displayed, given the current context and program argument.

  10. When updating a specific phrase for language translation, the next screen re-draw should show the new translation text. Thus translations need to be dynamically changeable, and run-time configurable.

INTERNAL TEXT ENCODING

This module uses UTF-8 text encoding internally, thus it requires a minimum of Perl 5.8. So, for any given application string and user language combination, we require the backing store look-up the combination, then return a list of Locale::MakePhrase::LanguageRule objects, which must be created with the key and translated strings being stored in the UTF-8 encoding.

Thus, to simplify the string-load functionality, we recommend to load / store the translated strings as UTF-8 encoded strings. See Locale::MakePhrase::BackingStore for more information.

ie.

The PostgreSQL backing store assumes that the database instance stores strings in the UNICODE encoding (rather than, say, ASCII); this avoids the need to translate every string when we load it.

OUTPUT TEXT ENCODING

Locale::MakePhrase uses UTF-8 encoding internally, as described above. This is also the default output encoding. You can choose to have a different output encoding, such as ISO-8859-1.

Normlly, if the output display mechanism can display UNICODE (encoded as UTF-8), then text will be rendered in the correct language and correct text direction (ie. left-to-right or right-to-left).

By supplying the encoding as a constructor argument, Locale::MakePhrase will transpose the translated text from UTF-8, into your output-specific encoding (using the Encode module). This is useful in cases where font support within an application, hasn't yet evolved to the same level as a language-specific font.

See the Encode module for a list of available output encodings.

Default output character set encoding: UTF-8

WHAT ARE LINGUISTIC RULES?

Since the concept of a linguistic rule is at the heart of this translation module, its documentation is located in Locale::MakePhrase::RuleManager. It explains the syntax of the rule expressions, how rules are sorted and selected, as well as the operators and functions that are available within the expressions. You should read that information, before continuing.

Available operators:

==, !=, <, >, <=, >=, eq, ne

Available functions:

defined(x), length(x), int(x), abs(n), lc(s), uc(s), left(s,n), right(s,n), substr(s,n), substr(s,n,r)

Object API

The following methods are part of the Locale::MakePhrase object API:

new()

Construct new instance of Locale::MakePhrase object. Takes the following named parameters (ie: via a hash or hashref):

language
languages

Specify one or more languages which are used for locating the correct language string (all forms are supported; first found is used).

They take either a string (eg 'en'), a comma-seperated list (eg 'en_AU, en_GB') or an array of strings (eg ['en_AU','en_GB']).

The order specified, is the order that phrases are looked up. These strings go through a manipulation process (using the Perl module I18N::LangTags) of:

  1. The strings are converted to RFC3066 language tags; these become the primary tags.

  2. Superordinate tags are retrieved for each primary tag.

  3. Alternates of the primary tags are then retrieved.

  4. Panic language tags are retrieved for each primary tag (if enabled).

  5. The fallback language is retrieved (see 'fallback language').

  6. Duplicate language tags are removed.

  7. All tags are converted to lowercase, and '-' are changed to '_'.

This leaves us with a list of at least the fallback language.

charset
encoding

This option (both forms are supported; first found is used) allows you to change the output character set encoding, to something other than UTF-8, such as ISO-8859-1.

See ENCODING for more information.

backing_store

Takes either a reference to a backing store instance, or to a string which can be used to dynamically construct the instance.

The final backing store instance must have a type of Locale::MakePhrase::BackingStore.

Default: use a Locale::MakePhrase::BackingStore

rule_manager

Takes either a reference to a rule manager instance, or to a string which can be used to dynamically construct the instance.

The final manager instance must have a type of Locale::MakePhrase::RuleManager.

Default: use a Locale::MakePhrase::RuleManager

malformed_character_mode

Perl normally outputs \x{HH} for malformed characters (or \x{HHHH}, \x{HHHHHH}, etc. for wide characters). Setting this value, changes the behaviour to output alternative character entity formats.

Note that if you are using Locale::MakePhrase to generate strings used within web pages / HTML, you should set this parameter to Locale::MakePhrase->MALFORMED_MODE_HTML.

numeric_format

This option allows the user to control how numbers are output. You can set the output to be one of a number of forms of stringification defined in Locale::MakePhrase::Numeric, eg:

'.', ',', '(', ')'

Place comma seperators before every third digit; use brackets for negative, as in: (10,000,000.1)

This takes either a string format or an array reference containing the format.

Default: dont format; show decimal as full-stop

die_on_bad_translation

Set this option to true to make Locale::MakePhrase die if the translated string is incorrectly formatted (eg: too many argument place holders are specified) or the expression is not valid. The alternative is to output the phrase <INVALID TRANSLATION> or <INVALID EXPRESSION>.

Die'ing here means that translations have the ability to abort your code. If you dont have control over the quality of the phrases added to your dictionary, you should probably use the default behaviour.

Note that an invalid expression or translation generates a warning to STDERR.

Default: dont die; output the appropriate error phrase

translate_arguments

Set this option to false to make Locale::MakePhrase not translate the applied arguments, before applying them to the output of the engine. This saves you from having to call translate() for each argument, within your own code.

Default: do translate arguments

add_newline

Set this option to true to make Locale::MakePhrase automatically add newline characters to the end of every translated string. The reason for having this is to allow your translation-key to not require the OS-dependent newline character(s), and to not require newline character(s) on the target-translation.

Note that the API provides alternate method calls so as to allow you to add newline character(s) as necessary.

Default: dont add any newline characters

panic_language_lookup

Set this option to true to make Locale::MakePhrase load 'panic' languages as defined by "panic_languages" in I18N::LangTags. Basically it provides a mechanism to allow the engine to return a language string from languages which has a similar heritage to the primary language(s), if a translation from the primary language hasn't been found.

eg: Spanish has a similar heritage as Italian, thus if no translations are found in Italian, then Spanish translations will be used.

Default: dont lookup panic-languages

Notes:

If the arguments aren't a hash or hashref, then we assume that the arguments are languages tags.

If you dont supply any language, the fallback language will be used.

Default language: en

$self init([...])

Allow sub-class a chance to control construction of the object. You must return a reference to $self, to 'allow' the construction to complete.

At this point of construction you can call $self->options() which returns a reference to the current constructor options. This allows you to add/modify any existing options; for example you may want to inject something specific...

$string context_translate($context, $string [, ...])

[ $context is either a text string or an object reference (which then gets stringified into its class name). ]

This is a primary entry point; call this with your application context, your string and any program arguments which need to be translated. Note however that in most cases you will most likely want to call the translate function instead; see below.

In some cases you will find that you will use the same text phrase in one part of your application, in a seperate part of your application, but the meaning of the phrase is different (due to the different application context); supplying a context will allow your backing store to use the extra context information, to return the correct language rules.

The steps involved in a string translation are:

  1. Fetch all possible translation rules for all language tags (including alternates and the fallbacks), from the backing store. The store will return a list reference of LanguageRule objects.

  2. Sort the list based on the implementation defined in the Locale::MakePhrase::RuleManager module.

  3. The the rule instance for which the rule-expression evaluates to true for the supplied program arguments (if there is no expression, the rule is always true).

  4. If no rules have been selected, then make a rule from the input string.

  5. Apply the program arguments to the rules' translated text. If the argument is a text phrase, it (optionally) undergoes the language translation procedure. If the argument is numeric, it is formatted by one of your language sub-classes, or the Locale::MakePhrase::Numeric module.

  6. We apply the output character set encoding to convert the text from UTF-8 into the prefered character set. If the output encoding is UTF-8 (thus matching the internal encoding), this item does nothing.

$string translate($string [, ...])

This is a primary entry point; call this with your string and any program arguments which need to be translated.

This function is a wrapper around the context_translate function, where the context is set to undef (which is usually what you want).

$string context_translate_ln($context, $string [, ...])

This is a primary entry point; call this with your context, string and any program arguments which need to be translated.

This function is a wrapper around the context_translate function, but this adds newline character(s) to the output.

$string translate_ln($string [, ...])

This is a primary entry point; call this with your string and any program arguments which need to be translated.

As above, this function is a wrapper around the context_translate function, where the context is set to undef, but this adds newline character(s) to the output.

$string format_number($number,$options)

This method implements the numbers-specific formatting, by calling into Locale::MakePhrase::Numeric's stringify_number method.

To provide custom handling of number formatting, you can do one of:

$backing_store fallback_backing_store()

Backing store to use, if not specified on construction. You can overload this in a sub-class.

$string fallback_language()

Language to fallback to, if all others fail (this defaults to 'en'). You can override this method in a sub-class.

Usually this will be the language that you are writing your application code (eg: you may be coding using German rather than English).

Note that this must return a RFC-3066 compliant language tag.

$string_array language_classes()

This method returns a list of possible class names (which must be sub-classes of Locale::MakePhrase::Language) which can get prepended to the language tags for this instance. Locale::MakePhrase will then try to dynamically load these modules during construction.

The idea being that you simply need to put your language-specific module in the same directory as your sub-class, thus we will find the custom modules.

Alternatively, you can sub-class this method, to return the correct class heirachy name.

$format numeric_format($format)

This method allows you to set and/or get the format that is being used for numeric formatting. You can supply an array, an array ref, or a string.

Accessor methods

$hash options()

Returns the options that were supplied to the constructor.

$string_array languages()

Returns a list of the language tags that are in use.

$object_list language_modules()

Returns a list of the loaded language modules.

$object backing_store()

Returns the loaded backing store instance.

$object rule_manager()

Returns the loaded rule manager instance.

$string encoding()

Returns the output character set encoding.

$int malformed_character_mode()

Returns the current UTF-8 malformed character output mode.

$bool die_on_bad_translation()

Returns the current state of 'die_on_bad_translation'.

$bool translate_arguments()

Returns the current state of 'translate_arguments'.

$bool add_newline()

Returns the current state of 'add_newline'.

$bool panic_language_lookup()

Returns the current state of 'panic_language_lookup'.

Function API

The following items are helper functions, which can be used to simplify the usage of Locale::MakePhrase objects.

$string mp($string [, ...])

This is a helper function to the translate() function call. It will use the last-constructed instance of Locale::MakePhrase to invoke the translate function on. eg:

  print mp("This is test no: [_1]",$test_no);

could produce:

  This is the first test.

$string __ $string [, ...]

This function is the same as the previous helper function, except that it makes you code easier to read and easier to write. eg:

  print __"This is test no: [_1]",$test_no;

could produce:

  This is test no: 4

Note that we use double-underscore as this makes search-n-replace tasks easier than if we used a single-underscore.

NOTE

The previous functions use a reference to an internal variable. If you are using this module from within Apache (say under mod_perl), make sure that you construct a new instance of a Locale::MakePhrase object, in the child Apache processes.

SUB-CLASSING

These modules can be used standalone, or they can be sub-classed so as to control certain aspects of its behaviour. Each inidividual module from this group, is capable of being sub-classed; refer to each modules' specific documentation, for more details.

In particular the Locale::MakePhrase::Language module is designed to be sub-classed, so as to support, say, language-specific keyboard input handling.

Construction control

Due to the magic of inheritance, there are two primary ways to control construction any of these modules:

  1. Overload the new() method

    • Implement the new() method in your sub-class

    • call SUPER::new() so as to execute the parent class constructor

    • re-bless the returned object

    For example:

      sub new {
        my $class = shift;
        ...
        my $self = $class->SUPER::new(...sub-class specific arguments...);
        $self = bless $self, $class;
        ...
        return $self;
      }
  2. Overload the init() method.

    • implement the init() method in your sub-class

    • return a reference to the current object.

    For example:

      sub init {
        my $self = shift;
        ...
        return $self;
      }

Sub-classing this module

This module (Makephrase.pm) has a number of methods which can be overloaded:

  • init()

  • fallback_backing_store()

  • fallback_language()

  • language_classes()

  • format_number()

DEBUGGING

Since this module and framework are relativley new, it is quite likely that a few bugs may still exist. By setting the module-specific DEBUG variable, you can enable debug messages to be sent to STDERR.

Set the value to zero, to disable debug. Setting progressively higher values (up to a maximum value of 9), results in more debug messages being generated.

The following variables can be set:

  $Locale::MakePhrase::DEBUG
  $Locale::MakePhrase::RuleManager::DEBUG
  $Locale::MakePhrase::LanguageRule::DEBUG
  $Locale::MakePhrase::BackingStore::Cached::DEBUG
  $Locale::MakePhrase::BackingStore::File::DEBUG
  $Locale::MakePhrase::BackingStore::Directory::DEBUG
  $Locale::MakePhrase::BackingStore::PostgreSQL::DEBUG

NOTES

Text directionality

This module internally uses UTF-8 character encoding for text storage for a number of reasons, one of them being for the ability to encode the directionality within the text string using Unicode character glyphs.

However it is up to the application drawing mechanism to support the correct interpretation of these Unicode glyphs, before the text can be displayed in the correct direction.

Localised text layout

In some languages there may be a requirement that we layout the application interface, using a different layout scheme than what would normally be available. This requirement is known as layout localisation. An example might be, Chinese text should prefer to layout top-to-bottom left-to-right, (rather than left-to-right top-to-bottom).

This module doesn't provide this facility, as that is up to the application layout mechanism to handle the differences in layout. eg: A web-browser uses HTML as a formatting language; web-browsers do not implement top-to-bottom text layout.

SEE ALSO

Locale::MakePhrase is made up of a number of modules, for which there is POD documentation for each module. Refer to:

. Locale::MakePhrase::Language
. Locale::MakePhrase::Language::en
. Locale::MakePhrase::LanguageRule
. Locale::MakePhrase::RuleManager
. Locale::MakePhrase::BackingStore
. Locale::MakePhrase::BackingStore::File
. Locale::MakePhrase::BackingStore::Directory
. Locale::MakePhrase::BackingStore::PostgreSQL
. Locale::MakePhrase::Utils
. Locale::MakePhrase::Numeric
. Locale::MakePhrase::Print

It also uses the following modules internally:

. Encode
. Encode::Alias
. I18N::LangTags

You can (and should) read the documentation provided by the Locale::Maketext module.

BUGS

Multiple levels of quoting

The rule expression parser cannot handle multiple levels of quoting. It needs modification to support this (however, this may make the parser slower).

Expression parsing failure

The rule expression parser splits the rule into sub-expressions by chunking on ' && '. This means it will fail to parse a text evaluation containing these characters. For example this will fail to parse:

  _1 eq ' && '

Since the ' && ' is not a common text expression, this bug will probably never be fixed.

TODO

Need to add support for male / female context of phrase. This could be implemented using a context specific translation, however the better way would be to add native support for gender.

CREDITES

This module was written for NetRatings, Inc.; they paid for part of my time to develop this module.

Various suggestions and bug fixes were also provided by:

Brendon Oliver
John Griffin

LICENSE

This module was written by Mathew Robertson mailto:mathew@users.sf.net for NetRatings, Inc. http://www.netratings.com. Copyright (C) 2006

This module is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License version 2 (or at your option, any later version) as published by the Free Software Foundation http://www.fsf.org.

This module is distributed WITHOUT ANY WARRANTY WHATSOEVER, in the hope that it will be useful to others.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 1077:

=cut found outside a pod block. Skipping to next block.