The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Locale::MakePhrase - Language translation facility

SYNOPSIS

These group of modules are used to translate application text strings, which may or may not include values which also need to be translated, into the prefered language of the end-user.

Example:

  use Locale::MakePhrase::BackingStore::Directory;
  use Locale::MakePhrase;
  my $bs = new Locale::MakePhrase::BackingStore::Directory(
    directory => '/some/path/to/language/files',
  );
  my $mp = new Locale::MakePhrase(
    language => 'en_AU',
    backing_store => $bs,
  );
  ...
  my $file_count = 1;
  print $mp->translate("Please select [_1] colors.",$file_count);

Output:

  Please select a colour.

Notice that a) the word 'color' has been localised to Australian English, and b) that the argument has influenced the resultant output text to take into account the display of the singular version.

DESCRIPTION

This aim of these modules are to implement run-time evaluation of an input phrase, including program arguments, and have it generate a suitable output phrase, in the language and encoding specified by the user of the application.

Since this problem has been around for some time, there are a number of sources of useful information available on the web, which describes why this problem is hard to solve. The problem with most existing solutions is that each design suffers some form of limitation, often due to the designer thinking that there are enough commonalities between all/some langugaes that these commonalities can be factored into a various rules which can be implemented in programming code.

However, each language has it own history and evolution. Thus it is pointless to compare two different languages unless they have a common history and a common character set.

Before continuing to read this document, you really should read the following info on the Locale::Maketext Perl module:

  http://search.cpan.org/~sburke/Locale-Maketext-1.08/lib/Locale/Maketext.pod

and at the slides presented here:

  http://www.autrijus.org/webl10n/

The Locale::MakePhrase modules are based on a design similar to the Locale::Maketext module, except that this new implementation has taken a different approach, that being...

Since it is possible (and quite likely) that the application will need to be able to understand the language rules of any specific language, we want to use a run-time evaluation of the rules that a linguist would use to convert one language to another. Thus we have coined the term linguistic rules as a means to describe this technique. These rules are used to decide which piece of text is displayed, for a given input text and arguments.

REQUIREMENTS

The Locale::MakePhrase module was initially designed to meet the requirements of a web application (as opposed to a desktop application), which may display many languages in the HTML form at any given instance.

Its design is modelled on a similar design of using language lexicons, which is in use in the existing Locale::Maketext Perl module. The reason for building a new module is because:

-

We wanted to completely abstract the language rule capability, to be programming language agnostic so that we could re-implement this module in other programming languages.

-

We needed run-time evaluation of the rules, since the translations may be updated at any time; new rules may be added whenever there is some ambigutiy in the existing phrase. Also, we didn't want to re-start the application whenever we updated a rule.

-

We would like to support various types of storage mechanisms for the translations. The origonal design constraint prefered the use of a PostgreSQL database to hold the translations - most existing language translation systems use flat files.

-

We want to store/manipulate the current text phrase, only encoded in UTF-8 (ie we dont want to store the text in a locale-specific encoding). This allows us to output text to any other character set.

As an example of application usage, it is possible for a Hebrew speaking user to be logged into a web-form which contains Japanese data. As such they will see:

a)

Menus and tooltips will be translated into the users' language (ie Hebrew).

b)

Titles will be in the language of the dataset (ie Japanese).

c)

Some of the data was in Latin character set (ie English).

d)

If the user prefered to see the page as RTL rather than LTR, the page was altered to reflect this preference.

BACKGROUND

When implementing any new software, it is necessary to understand the problem domain. In the case of language translation, there are a number of requirements that we can define:

a)

Quite a few people speak multiple languages; we would like the language translation system to use the users preferred language localisation, or if we don't know which language that is, try to make an approximate guess, based on application capabilites.

Eg:

In a web-browser, the user normally sets their prefered language/dialect. The browser normally sends this information to a web-server during the request for a page. The server may choose to show the page contents in the language the user prefers.

b)

Since some people speak multiple languages, the application may not have been localised to their prefered localisation. We should try to fallback to using a language which is similar.

Eg:

If there are no Spanish translations available, we should fallback to Mexican, since Mexican and Spanish have many words in common.

c)

Some languages support the notion of a dialect for that language. A good example is that the English language is used in many countries, but countries such as the United States, Australia and Great Britain each have their own localised version ie. the dialect is specified as the country or region. The language translation mechanism needs to be able to use the users' preferred dialect when looking up the text to display. If no translation is found, then it should fall back to the parent language.

Eg:

The language/dialect of Australia is defined as 'en_AU' - when we lookup a text translation, if we fail we should try to lookup the 'en' translation.

d)

Some languages are written using a script which displays its output as right-to-left text (as used by Arabic, Hebrew, etc), rather than left-to-right text (as used by English, Latin, Greek, etc). The language translation mechanism should allow the text display mechanism to change the text direction if that is a requirement (which is another reason for mandating the use of UTF-8).

e)

The string to be translated should support the ability to re-order the wording of the text.

Eg:

In English we would normally say something like "Please enter your name"; in Japanese the equivalent translation would be something like "Enter your name, please" (although it would be in Japanese, not English).

f)

The text translation mechanism should support the ability to show arguments supplied to the string (by the application), within the correct context of the meaning of the string.

Eg:

We could say something like "You selected 4 balls" (where the number 4 is program dependant); in another language you may want to say the equivalent of "4 balls selected".

Notice that the numeric position has moved from being the third mnemonic, to being the first mnemonic. The requirement is that we would like to be able to rearrange the order/placement of any mnemonic (including any program arguments).

g)

We would like to be able to support an arbitrary number of argument replacements. We shouldn't be limited in the number of replacements that need to occur, for any given number program arguments.

Eg:

We want to have an unlimited number of placeholders as exemplified by the string "Select __ balls, __ bats, __ wickets, plus choose __ people, ___ ..." and so on.

h)

Most program arguments that are given to strings are in numeric format (i.e. they are a number). We would also like to support arguments which are text strings, which themselves should be open to language translation (but only after rule evaluation). The purpose being that the output phrase should make sense within the current context of the application.

i)

In a lot of languages there is the concept of singular and plural. While in other languages there is no such concept, while in others still there is the concept of duality. There is also the concept that a phrase can be descriptive when discussing the zero of something. Thus we want to display a specific phrase, depending on the value of an argument.

Eg:

In English, the following text "Selected __ files" has multiple possible outputs, depending on the program value; we can have:

 0 case: "No files selected" - no numeric value
 1 case: "One file selected" - 'files' is singular
 2 case: "Selected two files" - the '__' is a text value, not a number
 more than 2 case: "Lots of selections" - no direct comparison to the original text

...as we can see, this is just for translating a single text string, from English to English.

To counter this problem, the translation system needs to be able to apply linguistic rules to the original text, so that it can evaluate which piece of text should be displayed, given the current context and program argument.

j)

When updating a specific phrase for language translation, the next screen re-draw should show the new translation text. Thus translations need to be dynamically changeable, and run-time configurable.

INTERNAL TEXT ENCODING

This module uses UTF-8 text encoding internally, thus it requires a minimum of Perl 5.8. So, for any given application string and user language combination, we require the backing store look-up the combination, then return a list of Locale::MakePhrase::LanguageRule objects, which must be created with the key and translated strings being stored in the UTF-8 encoding.

Thus, to simplify the string-load functionality, we recommend to load / store the translated strings as UTF-8 encoded strings. See Locale::MakePhrase::BackingStore for more information.

ie.

The PostgreSQL backing store assumes that the database instance stores strings in the UNICODE encoding (rather than, say, ASCII); this avoids the need to translate every string when we load it.

OUTPUT TEXT ENCODING

Locale::MakePhrase uses UTF-8 encooding internally, as described above. This is also the default output encoding. You can choose to have a different output encoding, such as ISO-8859-1.

Normlly, if the output display mechanism can display UNICODE (encoded as UTF-8), then text will be rendered in the correct language and correct text direction (ie. left-to-right or right-to-left).

By supplying the encoding as a constructor argument, Locale::MakePhrase will transpose the translated text from UTF-8, into your output-specific encoding (using the Encode modeule). This is useful in cases where font support within an application, hasn't yet evolved to the same level as a language-specific font.

See the Encode module for a list of available output encodings.

Default output character set encoding: UTF-8

WHAT ARE LINGUISTIC RULES?

Since the concept of a linguistic rule is at the heart of this translation module, its documentation is located in Locale::MakePhrase::RuleManager. It explains the syntax of the rule expressions, how rules are sorted and selected, as well as the operators and functions that are available within the expressions. You should read that information, before continuing.

Available operators:

==, !=, <, >, <=, >=, eq, ne

Available functions:

defined(x), length(x), int(x), abs(n), lc(s), uc(s), left(s,n), right(s,n), substr(s,n), substr(s,n,r)

API

The following functions part of the core Locale::MakePhrase API:

new()

Construct new instance of Locale::MakePhrase object. Takes the following named parameters (ie via a hash or hashref):

language
languages

Specify one or more languages which are used for locating the correct language string (all forms are supported; first found is used).

They take either a string (eg 'en'), a comma-seperated list (eg 'en_AU, en_GB') or an array of strings (eg ['en_AU','en_GB']).

The order specified, is the order that phrases are looked up. These strings go through a manipulation process (using the Perl module I18N::LangTags) of:

1)

The strings are converted to RFC3066 language tags; these become the primary tags.

2)

Superordinate tags are retrieved for each primary tag.

3)

Alternates of the primary tags are then retrieved.

4)

Panic language tags are retrieved for each primary tag (if enabled).

5)

The fallback language is retrieved (see 'fallback language').

6)

Duplicate language tags are removed.

7)

All tags are converted to lowercase, and '-' are changed to '_'.

This leaves us with a list of at least the fallback language.

charset
encoding

This option (both forms are supported; first found is used) allows you to change the output character set encoding, to something other than UTF-8, such as ISO-8859-1.

See ENCODING for more information.

backing_store

Takes either a reference to a backing store instance, or to a string which can be used to dynamically construct the instance.

The final backing store instance must have a type of Locale::MakePhrase::BackingStore.

Defaults to: Locale::MakePhrase::BackingStore

rule_manager

Takes either a reference to a rule manager instance, or to a string which can be used to dynamically construct the instance.

The final manager instance must have a type of Locale::MakePhrase::RuleManager.

Defaults to: Locale::MakePhrase::RuleManager

malformed_character_mode

Perl normally outputs \x{HH} for malformed characters (or \x{HHHH}, \x{HHHHHH}, etc. for wide characters). Setting this value, changes the behaviour to output alternative character entity formats.

Note that if you are using Locale::MakePhrase to generate strings used within web pages / HTML, you should set this parameter to MALFORMED_MODE_HTML.

numeric_format

This option allows the user to control how numbers are output. You can set the output to be one of three forms. Numeric stringification can be set to output using one of the following formats:

NUMERIC_FORMAT_NONE

Dont do any pretty-printing of the number

NUMERIC_FORMAT_COMMA

Place comma seperators before every third digit, as in: 10,000,000.0

NUMERIC_FORMAT_DOT

Place dot (full-stop) seperators before every third digit, as in: 10.000.000.000,0

Defaults to: NUMERIC_FORMAT_COMMA

die_on_bad_args

Set this option to true to make Locale::MakePhrase die if there is an error in your usage of this module.

Default: dont die; gracefully handle mis-use by replacing missing argument with an empty string.

show_bad_args

Set this option to true to make Locale::MakePhrase show the which application string arguments, are not defined. It does this by using the phrase <UNDEFINED>, in place of the undefined value.

Default: dont show undefined arguments; gracefully replaces arguments with an empty string

panic_language_lookup

Set this option to true to make Locale::MakePhrase load 'panic' languages as defined by "panic_languages" in I18N::LangTags. Basically it provides a mechanism to allow the engine to return a language string from languages which has a similar heritage to the primary language(s), if a translation from the primary language hasn't been found.

eg: Spanish has a similar heritage as Italian, thus if no translations are found in Itelian, then Spanish translations will be used.

Default: donet lookup panic-languages

Notes:

If the arguments aren't a hash or hashref, then we assume that the arguments are languages tags.

If you dont supply any language, the fallback language will be used. Default language: en.

$self init([...])

Allow sub-class a chance to control construction of the object. You must return a reference to $self, to 'allow' the construction to complete.

At this point of construction you can call $self->options() which returns a reference to the current constructor options. This allows you to add/modify any existing options; for example you may want to inject something specific...

$string translate($string [, ...])

This is a primary entry point; call this with your string and any program arguments which need to be translated.

This function is a wrapper around the "context_translate" function, where the context is set to undef.

$string context_translate($context, $string [, ...])

[$context is either a text string or an object reference (which then gets qualified into its class name).]

This is a primary entry point; call this with your application context, your string and any program arguments which need to be translated.

In most cases the context is undef (as a result of being called via the translate function). However, in some cases you will find that you will use the same text phrase in one part of your application, as another part of your application, but the meaning of the phrase is different, due to the different application context; supplying a context will allow your backing store to use the extra context information, to return the correct language rules.

The steps involved in a string translation are:

1)

Fetch all possible translation rules for all language tags (including alternates and the fallbacks), from the backing store. The store will return a list reference of LanguageRule objects.

2)

Sort the list based on the implementation defined in the Locale::MakePhrase::RuleManager module.

3)

The the rule instance for which the rule-expression evaluates to true for the supplied program arguments (if there is no expression, the rule is always true).

4)

If no rules have been selected, then make a rule from the input string.

5)

Apply the program arguments to the rules' translated text. If the argument is a text phrase, it undergoes the language translation procedure. If the argument is numeric, it is formated by the format_numeric method.

6)

We apply the output character set encoding to convert the text from UTF-8 into the prefered character set. (This does nothing if the output encoding is UTF-8.)

$string format_number($number)

This method implements the numbers-specific formatting. The default implementation will stringify the number (which usually means that we put a comma seperator) by delegating the call to the stringify_number method.

To provide custom handling of number formatting, you can do one of:

a)

Implement 'per-language' number formatting, by sub-classing the Locale::MakePhrase::Language module, then implementing a format_number method.

b)

Sub-class Locale::MakePhrase, then overload the format_number method.

c)

Set the available Locale::MakePhrase number formatting options; these options affect the stringify_number method.

$string stringify_number($number)

This method implements the stringification of number to a suitable output format (as defined by the numeric_format constructor argument).

When sub-classing Locale::MakePhrase, by overloading this method you can implement custom numeric stringification.

$backing_store fallback_backing_store()

Backing store to use, if not specified on construction. You can overload this in a sub-class.

$string fallback_language()

Language to fallback to, if all others fail (this defaults to 'en'). You can override this method in a sub-class.

Usually this will be the language that you are writing your application code (eg you may be coding using German rather than English).

Note that this must return a RFC-3066 compliant language tag.

$string_array language_classes()

This method returns a list of possible class names (which must be sub-classes of Locale::MakePhrase::Language) which can get prepended to the language tags for this instance. Locale::MakePhrase will then try to dynamically load these modules during construction.

The idea being that you simply need to put your language-specific module in the same directory as your sub-class, thus we will find the custom modules.

Alternatively, you can sub-class this method, to return the correct class heirachy name.

$format numeric_format(<$format>)

This method allows you to set and/or get the format that is being used for numeric formatting.

Accessor methods

$hash options()

Returns the options that were supplied to the constructor.

$string_array languages()

Returns a list of the language tags that are in use.

$object_list language_modules()

Returns a list of the loaded language modules.

$object backing_store()

Returns the loaded backing store instance.

$object rule_manager()

Returns the loaded rule manager instance.

$string encoding()

Returns the output character set encoding.

$int malformed_character_mode()

Returns the current UTF-8 malformed character output mode.

$bool die_on_bad_args()

Returns the current state of 'die_on_bad_args'.

$bool show_bad_args()

Returns the current state of 'show_bad_args'.

$bool b<panic_language_lookup()>

Returns the current state of 'panic_language_lookup'.

SUB-CLASSING

These modules can be used standalone, or they can be sub-classed so as to control certain aspects of its behaviour. Each inidividual module from this group, is capable of being sub-classed; refer to each modules' specific documentation, for more details.

In particular the Locale::MakePhrase::Language module is designed to be sub-classed, so as to support, say, language-specific keyboard input handling.

Construction control

Due to the magic of inheritance, there are two primary ways to control construction any of these modules:

a)

Overload the new() method

-

Implement the new() method in your sub-class

-

call SUPER::new() so as to execute the parent class constructor

-

re-bless the returned object

For example:

  sub new {
    my $class = shift;
    ...
    my $self = $class->SUPER::new(...sub-class specific arguments...);
    $self = bless $self, $class;
    ...
    return $self;
  }
b)

Overload the init() method.

-

implement the init() method in your sub-class

-

return a reference to the current object.

For example:

  sub init {
    my $self = shift;
    ...
    return $self;
  }

Sub-classing this module

This module (Makephrase.pm) has a number of methods which can be overloaded:

-

init()

-

fallback_backing_store()

-

fallback_language()

-

language_classes()

-

format_number()

-

stringify_number()

DEBUGGING

Since this module and framework are relativley new, it is quite likely that a few bugs may still exist. By setting the module-specific DEBUG variable, you can enable debug messages to be sent to STDERR.

Set the value to zero, to disable debug. Setting progressively higher values (up to a maximum value of 9), results in more debug messages being generated.

The following variables can be set:

  $Locale::MakePhrase::DEBUG
  $Locale::MakePhrase::RuleManager::DEBUG
  $Locale::MakePhrase::LanguageRule::DEBUG
  $Locale::MakePhrase::BackingStore::Cached::DEBUG
  $Locale::MakePhrase::BackingStore::File::DEBUG
  $Locale::MakePhrase::BackingStore::Directory::DEBUG
  $Locale::MakePhrase::BackingStore::PostgreSQL::DEBUG

NOTES

Text directionality

This module internally uses UTF-8 character encoding for text storage for a number of reasons, one of them being for the ability to encode the directionality within the text string using Unicode character glyphs.

However it is up to the application drawing mechanism to support the correct interpretation of these Unicode glyphs, before the text can be displayed in the correct direction.

Localised text layout

In some languages there may be a requirement that we layout the application interface, using a different layout scheme than what would normally be available. This requirement is known as layout localisation. An example might be, Chinese text should prefer to layout top-to-bottom then left-to-right, (rather than left-to-right then top-to-bottom).

This module doesn't provide this facility, as that is up to the application layout mechanism to take into differences in layout. eg: A web-browser uses HTML as a formatting language; web-browsers do not implement top-to-bottom text layout.

SEE ALSO

Locale::MakePhrase is made up of a number of modules, for which there is POD documentation for each module. Refer to:

. Locale::MakePhrase::Language
. Locale::MakePhrase::Language::en
. Locale::MakePhrase::LanguageRule
. Locale::MakePhrase::RuleManager
. Locale::MakePhrase::BackingStore
. Locale::MakePhrase::BackingStore::File
. Locale::MakePhrase::BackingStore::Directory
. Locale::MakePhrase::BackingStore::PostgreSQL
. Locale::MakePhrase::Utils

It also uses the following modules internally:

. Encode
. Encode::Alias
. I18N::LangTags

You can (and should) read the documentation provided by the Locale::Maketext module.

BUGS

Multiple levels of quoting

The rule expression parser cannot handle multiple levels of quoting. It needs modification to support this.

Expression parsing failure

The rule expression parser splits the rule into sub-expressions by chunking on ' && '. This means it will fail to parse a text evaluation containing these characters. For example this will fail to parse:

  _1 eq ' && '

Since the ' && ' is not a common text expression, this bug will probably never be fixed.

TODO

Need to add support for male / female context of phrase. This could be implemented using a context specific translation, however the better way would be to add native support for gender.

CREDITES

This module was written for RedSheriff Limited; they paid for my time to develop this module.

Various suggestions and bug fixes were also provided by:

    Brendon Oliver
    John Griffin

LICENSE

This module was written by Mathew Robertson mailto:mathew@users.sf.net for RedSheriff Limited http://www.redsheriff.com. Copyright (C) 2004

This module is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License version 2 (or at your option, any later version) as published by the Free Software Foundation http://www.fsf.org.

This module is distributed WITHOUT ANY WARRANTY WHATSOEVER, in the hope that it will be useful to others.