The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::SoundChange - Create regular sound changes

VERSION

This documentation describes version 0.05 of Lingua::SoundChange, released on 2002-05-20.

SYNOPSIS

  use Lingua::SoundChange;

  my $lat2port = Lingua::SoundChange->new($variables, $rules);
  # or
  my $lat2port = Lingua::SoundChange->new($variables, $rules, $options);

  my $original = [ 'first word', 'second word' ];
  my $translation = $lat2port->change($original);
  # changed words are now in $translation->[0]{word}
  #                      and $translation->[1]{word}, respectively.

DESCRIPTION

Introduction

This module is a sound change applier. With it, you can construct objects which will generate consistent sound changes. One way to use this is, for example, to simulate the sound change from one language to another (such as from Latin to Portuguese). It was inspired by Mark Rosenfelder's sound change applier program; see "SEE ALSO" for more information and a URL.

This module has an object-oriented interface. To use it, construct a Lingua::SoundChange object, which you can then use to apply sound changes to words. You can also have several different sound change objects around simultaneously, for example, to show sound change from a parent language to several different daughter languages, each with different sound change rules.

Methods

new(variables, rules [, options])

The constructor new creates a new Lingua::SoundChange object. It takes two or three parameters: a hash ref, an array ref, and another (optional) hash ref.

variables

The first parameter is a hash ref listing zero or more "variables". These are short cuts for character classes. (They may only be one character long, unless you use the longVars option, which see for more details.) For example, you could define S to be any stop, or F to be any front vowel. These are useful in the ruleset, described below. If you do not wish to use any variables, pass in a reference to an empty hash as the first parameter of the constructor.

Variables are often given capital letters to distinguish them from the "data" letters used in the rules, which are usually lowercase. This is not a requirement; however, note that if you have a source letter with the same name as a variable, the behaviour is undefined.

The keys of this hash ref are the names of the variables; the values are a string of letters which make up the variable. This is similar to a character range in Perl's regular expression (e.g. [aeiou] for a vowel); however, you should not include the brackets in the value.

For example, to make V a list of voiced consonants and U a list of corresponding unvoiced consonants, you could pass something like this to new:

  { V => 'ptk', U => 'bdg' }

If you use the "longVars" option, you can give your variables longer names. The variable names can contain any character except for angle brackets. If you use this option, then variable names must be enclosed in angle brackets in rules to differentiate them from normal letters. The above variables could then be written, for example, as:

  { '-vcd' => 'ptk', '+vcd' => 'bdg' }

Note that you must enclose the variable names in quotes if they contain characters other than letters, digits, and underscores.

rules

The second parameter is an array ref listing zero or more "rules". These rules describe which sound changes to apply in which environments. The sound changes will be applied in the order in which these rules are presented.

For more information on the format of these rules, see "Format of sound change rules".

If you use the option longVars, then variables must be enclosed in angle brackets, for example:

  <-vcd>/<+vcd>/<vowel>_<vowel>

for consonant voicing between vowels, assuming you have variables called "-vcd", "+vcd", and "vowel".

NOTE: You should not use characters in the rules or variable names which are special to regular expressions. This includes the following characters: . * + ? [ ] { } ( ). (Exception: the use of parentheses to mark something as optional in an environment.)

options

The third, optional, parameter is a hash ref of options which control what data is output or in which format the translated words are returned. Each key in the hash takes a Boolean value (true or false).

Possible options are:

longVars

If this option is set to a true value, then you can use long variable names (of more than one letter). These must be enclosed in angle brackets in rules.

Default: false

NOTE: The default may change to 'true' in the future, as I feel the 'longVars' behaviour is more convenient. The short variables will then only be supported for reasons of backwards compatibility to earlier versions of this module and for compatibility to Mark Rosenfelder's sounds program, but you will have to ask for the possibility explicitly by setting the longVars option to a false value.

The constructor returns a new Lingua::SoundChange object on success. On failure, the constructor will croak.

change(words)

Once you have constructed a Lingua::SoundChange object, you can use it to apply the sound changes you have described on words.

Pass in an array ref with one 'word' per array element. (Actually, each element can be anything you wish, but transformations are probably most commonly applied to individual words.) The sound changes specified in the constructor will be applied to each word in turn. The result will be an arrayref containing the transformed words. Each individual element of the arrayref will itself be a hashref with the following keys:

word

The transformed word

orig

The original word, the way you passed it into the function

rules

A reference to an array saying which rules applied to the word and at which character position they applied. (The array may be empty if no rules applied to this word.)

Each element of the output will look something like this:

  s-> /_# applies to secundus at 7 

(including a trailing newline). This can be useful while you are debugging your ruleset, for example; you can print out or otherwise examine this list to see how words go from the original form to their final modified form.

Note that change does not do any splitting of text into words for you; this is left up to you. The reason for this is that the concept of a word is left up to the user of the module. A simple case would be "a sequence of \w characters" or "a sequence of non-space characters".

EXPORT

None.

This module only has an object-oriented interface and does not export anything.

LONG EXPLANATION

The following explanation is largely taken from Mark Rosenfelder's own description of his sound change applier program sounds, and modified as appropriate for this module. The I in the following narrative is Mark's, not mine.

Note that all of Mark's examples use the short variable form, where each variable may only be one letter long and is not enclosed in angle brackets in the matching.

Basic operation

Lingua::SoundChange takes words as input, applies a set of sound changes described in variables and rules, and returns a set of modified words.

For instance, Lingua::SoundChange will take the input data, variables, and rules on the left and produce the output on the right:

  Input         Variables               Output

  lector        V => 'aeiou'            leitor
  doctor        C => 'ptcqbdgmnlrhs'    doutor
  focus         F => 'ie'               fogo
  jocus         B => 'ou'               jogo
  districtus    S => 'ptc'              distrito
  civitatem     Z => 'bdg'              cidade
  adoptare                              adotar
  opera         Rules                   obra
  secundus                              segundo
                s//_#
                m//_#
                e//Vr_#
                v//V_V
                u/o/_#
                gn/nh/_
                S/Z/V_V
                c/i/F_t
                c/u/B_t
                p//V_t
                ii/i/_
                e//C_rV

Format of sound change rules

Hopefully, the format of the rules will be familiar to any linguist. For instance, here's one sound change:

  c/g/V_V

This rule says to change c to g between vowels. (We'll see how to generalize this rule below.)

More generally, a sound change looks like this:

  x/y/z

where x is the thing to be changed, y is what it changes to, and z is the environment.

The z part must always contain an underline _, representing the part that changes. That can be all there is, as in

  gn/nh/_

which tells the module to replace gn with nh unconditionally.

The character # represents the beginning or end of the word. So

  u/o/_#

means to replace u with o, but only at the end of the word.

The middle (y) part can be blank, as in

  s//_#

This means that s is deleted when it ends a word.

Variables

The evironment (the z part) can contain variables, like V above. These are defined in the first parameter to the constructor. I use capital letters for this, though this is not a requirement. Variables can only be one character long. You can defined any variables needed to state your sound changed. E.g. you could define S to be any stop, or K for any coronal, or whatever.

So the variable definition and rule

  F => 'ie'

  c/i/F_t

means that c changes to i after a front vowel and before a t.

You can use variables in the first two parts as well. For instance, suppose you've defined

  S => 'ptc',
  Z => 'bdg'

  S/Z/V_V

This means that the stops ptc change to their voiced equivalents bdg between vowels. In this usage, the variables must correspond one for one--p goes to b, t goes to d, etc. Each character in the replacement variable (here Z) gives the transformed value of each character in the input variable (here S). Make sure the two variable definitions are the same length!

A variable can also be set to a fixed value, or deleted. E.g.

  Z//V_V

says to delete voiced stops between vowels, and

  Z/?/V_V

would translate all voiced stops between vowels to a glottal stop ?.

Rule order

Rules apply in the order they're listed. So, with the word opera and the rules

  p/b/V_V
  e//C_rV

the first rule voices the p, resulting in obera; the second deletes an e between a consonant and an intervocalic r, resulting in obra.

The printRules option can assist in debugging rules, since it causes the output to show exactly what rules applied to each word.

Optional elements in the environment

One or more elements in the environment can be marked as optional with parentheses. E.g.

  u/ü/_C(C)F

says to change u to ü when it's followed by one or two consonants and then a front vowel.

How to use it

The module is simple-minded and yet powerful... in fact it's powerful in part because it's simple-minded. You can do a lot with these basic pieces.

Input orthography

For instance, you may wonder whether the input data should be based on spellings or phonemes. It doesn't matter: the program applies its changes to whatever you give it. In my example I used conventional spellings, but I could just as easily have used a phonemic rendering. Similarly, I wrote the rules to output orthographic Portuguese, simply to make for an easy example. It would be better to output a phonetic representation. This would help us realize that we really need a sound change

  k/s/_F

that would handle the change from civitatem with /k/ to cidade with /s/.

The module will handle whatever you put into it, including accented characters. If the language you're working with requires a special font, simply edit the source and output data with an editor, using that font. This would allow you to use (say) an IPA font.

To improve my Latin-to-Portuguese rules, for instance, I would certainly want to handle vowel length and stress. I might use accented vowels for this. Of course the program knows nothing about phonetics, so you have to remember to define the variables to match how you've set up the input data. If you use accented vowels, you will want to change the definition of V.

Using digraphs

Though sound changes can refer to digraphs, variables can't include them. So, for instance, the following rule is intended to delete an i onset following an intervocalic consonant:

  i//VC_V

However, it won'f affect (say) achior, because the C will not match the digraph ch. You could write extra rules to handle the digraphs; but it's often more convenient to use an orthography where every phoneme corresponds to a single character.

You can write transformation rules at the beginning of your sound change rules to transform digraphs in the input data:

  ph/f/_

Using Lingua::SoundChange for conlang development

To create a child language from a parent, create some input data containing the vocabulary of the parent, then a list of variables and rules containing the sound changes you want to apply. Now use Lingua::SoundChange to generate the child language's vocabulary.

For example, you can download a vocabulary of Methaiun (ftp://ftp.enteract.com/users/markrose/metaiun.lex) and the sound changes for Kebreni (ftp://ftp.enteract.com/users/markrose/kebreni.sc). You can compare this to the Kebreni grammar (http://www.zompist.com/kebreni.htm) in Virtual Verduria (http://www.zompist.com/virtuver.htm).

For me, there is a peculiar, intense pleasure in creating a daughter language with a particular feel to it, merely by altering the set of sound changes. All I can think of to compare it to is creating new animals indirectly, by mutating their DNA.

What sort of sound changes should you use? You can examine the history of any language family for ideas. Some common changes that can form part of your repertoire (with some sample Lingua::SoundChange rules):

Lenition

Stops become frivatives; unvoiced consonants become voiced; stops erode into glottal stops, or h, or disappear. The intervocalic position is especially prone to change.

  S/Z/V_V
Palatalization

Consonants can palatalize before or after a front vowel i e, perhaps ending up as an affricate or fricative.

  k/ç/_F
Monophthongization.

Diphthongs tend to simplify. This rule is fun to apply after letting the vanished sounds affece adjoining consonants.

  i//CV_C
Assimilation

Consonants change to match the place or type of articulation of an adjoining consonant.

  D => 'td'

  m/n/_D
Nasalization

A nasal consonant can disappear, after nasalizing the previous vowel.

  'Â' => 'âêîôû',
  N => 'mn'

  V/Â/_N
  N//Â_
Umlaut

A vowel changes to match the rounding of the next vowel in the word.

  u/ü/_C(C)i
Vowel shifts

One vowel can migrate into a free area of the vowel space, perhaps dragging others behind it.

  a/&/_
  o/a/_
  u/o/_
Tonogenesis

One way tones can originate is for voiced consonants to induce the next vowel to be pronounced in a low pitch.

  Z => 'bdgzvmnlr',
  V => 'aiu',
  L => 'áíú'

  V/L/Z_
Loss of unstressed syllables
  A => 'áéíóú'

  V//AC(C)_
Loss of final sounds

This can really mess up your carefully worked out inflectional system.

  V//_#

The beauty part of using Lingua::SoundChange is that your language will illustrate the Neo-Grammarian principle: sound changes apply uniformly whenever their conditions are met. You may choose to edit the results by hand, however, to simulate the complications of real languages. Analogy can regularize the grammar; words may be borrowed from another dialect where different changes applied; words may be reborrowed from the parent language by scholars.

I pay particular attention to the havoc the sound changes are likely to wreak on the inflectional system. E.g. if a case distinction is maintained in some words and lost in others, it may spread to the second category by analogy.

Sound changes can also result in homonyms. For instance, if you voice intervocalic consonants, meta and meda will merge. You can simply live with this, but if the merger is particularly awkward, the users of the language are likely to invent a new word to replace one of the homonyms. E.g. Latin American Spanish has innovated cocinar "to cook", since the original cocer has merged with coser "to sew".

Using Lingua::SoundChange to find spelling rules

I've also used sounds to model the spelling rules of English. Here the input file lists the spellings of several thousand English words, and the "sound changes" are rules for turning those spellings into a phonetic representation of how the words sound.

Most people think English spelling hopeless; but in fact the rules predict the correct pronunciation of the word 60% of the time, and make only minor errors (e.g. insufficient vowel reduction) another 35% of the time.

A discussion of the rules, including the input and output files, is at http://www.zompist.com/spell.html .

DIFFERENCES BETWEEN sounds AND Lingua::SoundChange

This section lists the differences between Mark Rosenfelder's sounds program and Lingua::SoundChange, and how to convert from sounds input and instructions to Lingua::SoundChange.

Form of input

sounds takes two input files (xxx.lex and yyy.sc) and produces output on standard output (unless the -f option is given) and to a file yyy.out. xxx.lex is the lexicon of the input language, and yyy.sc contains the variables and sound changes and possibly comments.

Lingua::SoundChange splits these two up; the sound change file yyy.sc is passed to the constructor new while the lexicon xxx.lex is passed to change. Also, variables and rules are passed to new separately.

Variables and rules

yyy.sc, the sound change file accepted by sounds, may contain a mixture of variables (which must precede all rules), rules, and comments. Comments are marked by an asterisk * at the beginning of the line.

Lingua::SoundChange requires these two to be split up, and does not accept comments explicitly. However, if the list of sound changes is inside a Perl script, Perl comments can, of course, be used.

Converting a sound change file on-the-fly

Here's a simple way to convert a yyy.sc file on-the-fly into something which is suitable as input to new.

  my(%vars, @rules);
  open SC, '<port.sc' or die "Can't open port.sc: $!";
  while(<SC>) {
    next if /^\*/;    # skip comment line
    next unless /\S/; # skip blank lines;
    chomp;
    if(/^(.)=(.+)$/) {
      $vars{$1} = $2;
    } elsif(m{^[^/]+/[^/]*/.+$}) {
      push @rules, $_;
    }
  }

Specifying variables and rules in-line

If you specify variables and rules inside your script, rather than reading them in from some external source, you can use Perl comments in appropriate places if you wish. For example, you could translate

  * Vowels
  V=aeiou
  * Consonants
  C=bcdfghjklmnpqrstvwxyz

to

  {
    # Vowels
    V => 'aeiou',
    # Consonants
    C => 'bcdfghjklmnpqrstvwxyz',
  }

and

  * Lenition
  S/Z/V_V
  * Palatalization
  k/ç/_F

to

  [
    # Lenition
    'S/Z/V_V',
    # Palatalization
    'k/ç/_F',
  ]

.

Splitting up words

sounds assumes that xxx.lex will contain one word per line. It does not attempt to split words according to any rules; everything in one line is treated as one word. Therefore, converting a sounds .lex file to input for Lingua::SoundChange is simple; it could be done like this, for example:

  open LEX, '<latin.lex' or die "Can't read latin.lex: $!";
  my @words = <LEX>;
  chomp(@words);

Now \@words can be passed in to change as a list of words to transform.

Format of output

sounds outputs results like this:

  lector --> leitor

(or like this:

  leitor [lector]

if the -b switch was passed. Lingua::SoundChange normally outputs nothing, instead returning simply a hash reference containing orig => 'leitor' and word => 'lector'. It is up to the caller to format the output if this is desired.

Command-line switches

sounds takes several command-line switches:

-p

This tells sounds to print out which rules apply to each word. Use the rules key in the hash returned by change in Lingua::SoundChange for this and print out its contents.

-b

This causes sounds to print the original word in brackets behind the changed word, rather than before the changed word and an arrow.

This switch is not supported directly by Lingua::SoundChange; format the output as you desire.

-l

This switch causes sounds to omit the original word from the output, leaving only transformed words. Again, Lingua::SoundChange leaves it up to you to format the output however you wish; it always returns the original word, the transformed word, and a list of rules which applied.

-f

This switch causes sounds to write its output only to yyy.out and not also to the screen.

This switch is not supported directly by Lingua::SoundChange, since it doesn't output anything either to a file or to the screen; instead, it returns the transformed words from change.

SEE ALSO

This module was inspired by Mark Rosenfelder's sound change applier, documented at http://www.zompist.com/sounds.html , and by the sample code he provides there. The interface is slightly similar.

AUTHOR

Philip Newton, <pne@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2001, 2002 Philip Newton All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 606:

Non-ASCII character seen before =encoding in 'u/ü/_C(C)F'. Assuming CP1252