NAME
Lingua::FeatureMatrix - Perl extension for configuring groups of (e.g.) phonemes into feature groups
SYNOPSIS
use Lingua::FeatureMatrix;
# this example uses the module provided in the examples directory of
# the distro; you'll want to create your own 'Eme' subclass or
# modify 'Phone.pm' for yourself:
use lib 'examples';
use Phone;
# construct a new feature-matrix from a dat file (here using dat
# file same as example below)
my $matrix =
Lingua::FeatureMatrix->new(eme => Phone,
file => 'examples/phonematrix.dat');
if ($matrix->matchesFeatureClass('EE', 'VOW')) {
# EE is a "vow", bless this properly
push @Pope::ISA, 'Catholic';
}
if (not $matrix->matchesFeatureClass('AA', 'AFF')) {
# will be executed
$deadman->walking();
}
if ($matrix->matchesFeatureClass('S', 'VOW')) {
# won't happen
map { $_->fly() } @pigs;
}
# silliness aside, you can also dump a filled-out matrix, with all
# the implications spelled out, after loading it:
print $matrix->dumpToText(), "\n";
# you can also ask for a list of the emes that match a given object:
print "the vowels are:\n",
join ' ', $matrix->listFeatureClassMembers('VOW');
print "the affricates are:\n",
join ' ', $matrix->listFeatureClassMembers('AFF');
DESCRIPTION
Lingua::FeatureMatrix
is a class for managing user-defined feature-sets. It provides an implementation of datafile parsing that is generic and useful for anyone defining feature sets of symbols.
If you haven't read the "Motivation" you might want to skip down to it.
Featuresets are a common way of describing phonetics problems, e.g. sound change behaviors, but may be useful to people solving other problems as well. (The included Letter
class may, for example, be useful in writing ligature rules -- if you find this useful for some other application, please contact the author.)
Users must indicate what type of Eme
they are working with. In fact, users will probably want to define their own. To do this, define a subclass of Lingua::FeatureMatrix::Eme
and indicate that one as the eme
parameter to the new()
method call.
Creating your own Eme
type
Users should not have to provide very much to construct their own Lingua::FeatureMatrix::Eme
that supports all the features you're interested in.
See Lingua::FeatureMatrix::Eme for details on what's required to properly subclass Lingua::FeatureMatrix::Eme
.
If you'd rather not follow through on all the details specified there, you can use one of the two stubby subclasses Phone
and Letter
provided in the examples/
directory of this distribution as a jumping-off point. They too are documented, and have a loose licensing condition for your unrestricted use (see the README
).
Methods
Class methods
- new
-
Takes the following key-value named parameters:
- eme
-
Specifies the desired
Lingua::FeatureMatrix::Eme
subclass to use with thisLingua::FeatureMatrix
. - eme_opts
- featureclass
- featureclass_opts
- file
Instance methods
TO DO: complete documentation for these methods
- matchesFeatureClass
- listFeatureClassMembers
- findEquivalentEmes
- dumpToText
- add_implicature
- implicature_graph
Tutorial
Vocabulary
To keep this system general, there are several important terms to understand:
TO DO: clarify this vocabulary intro
- eme
-
I use the word eme to describe a single unit (one row of the feature matrix.)
(Think phoneme or grapheme.)
- implicature
-
Note these are language-specific. (TO DO: Give example here.)
(Think synchronic rule or feature generalization.)
- feature class
-
(Think composite feature.)
- feature
-
(Think single bit of descriptive information.)
Datafile format
You might want to begin by opening the phonematrix.dat
file or the lettermatrix.dat
file included in the examples
directory of this distribution. These use the feature sets defined by Phone.pm
and Letter.pm
, sample Eme
classes each also included in the same directory.
First, some basic terms that make up the underlying grammar of these datafiles:
- SIGN
-
Either
+
,-
, or*
, indicating the values of1
,0
, andundef
respectively. - FEATURE
-
A case-sensitive text string like
vow
indicating the name of theEme
feature. Always used withSIGN
. - FEATURESET
-
A complex grouping of one or more
SIGN
FEATURE
pairs, surrounded by[]
, like:[ +voice +fric +stop ]
- PHONESYMBOL
-
a string of characters matching the
/\S+/
regular expression. This is so widely accepting because of the large variety of phonetic representation schemes available. Leaving this agnostic allows users to use, e.g.:(TO DO: include examples here):
Each line in the datafile should be considered an entire statement. You'll find that the datafiles are made up of four kinds of lines. Comments, Eme descriptions, Implicatures, and feature classes. Future versions of this module may include more types of lines.
All lines are insensitive to whitespace, except for a Comment line (which isn't a Comment at all unless there is no whitespace before the '#').
- Comments
-
Any line beginning with a '#' is a comment, and the entire line is ignored. Note if the '#' is not the first character on a line, it is not ignored. This is the only place that whitespace is considered in this grammar.
- Eme descriptions
-
Any line which takes the form:
PHONESYMBOL [ FEATURESET ]
For example,
CH [ +stop +fric -voice ] J [ +stop +fric +voice ] S [ +fric +sib +alv -voice ] Z [ +fric +sib +alv +voice ] SH [ +fric +sib +pal -voice ] ZH [ +fric +sib +pal +voice ] AA [ +low -back -front -tense ] IY [ +high +front +tense ]
It is acceptable, even encouraged, to "underspecify", that is, to specify only those features which are needed to distinguish each phone from its neighbors. If you do so, you will probably want to include extra implicatures though, since any
Eme
that does not have all its features specified after the implicatures are processed will invoke acarp
, which can get irritating. - Implicatures
-
Any line which takes the form
( FEATURESET => FEATURESET )
represents an Implicature. The left
FEATURESET
is called the implier and the right is called the implicant.As a special case, the
FEATURESET
s involved may omit the[]
if there is only one feature.Implicatures allow the user to easily encode lots of different
Eme
s by encoding general "common sense" ideas. For example:( +stop => +cons )
This means that an
Eme
that is+stop
should be marked+cons
by implication. (If this isn't an obvious implication, you may need some phonology review, or you may be speaking Czech or Berber, and I can't help you much with either problem.)Note that more than one feature may imply the same setting, even to the same
Eme
. This is acceptable:( +fric => +cons ) ( +stop => +cons )
Both of these will apply to the following
Eme
definition:CH [ +fric +stop -voice ]
Implicatures are one-way, or else the following wouldn't work:
( -tense => +vow ) ( +tense => +vow )
(The two implications above indicate that if
tense
is specified at all, thenvow
should be+
by implication.)An implicature need not set a single feature in the implicant, nor is it restricted to only one feature in the implier.
( +sib => [ -voice +cons ] ) ( [ +vow +cons ] => [ +glide ] )
Note that some implicatures can point out that a certain field had better *not* be set (to either plus or minus); here we use the 'ungrammatical'
*
marker:( +cons => *tense ) ( +vow => [ *stop *fric ] )
The first example above indicates that if
cons
is true, then it is ungrammatical to specify a boolean value fortense
, and the second indicates that ifvow
is true, then it is ungrammatical to specifystop
orfric
. Note that the*fric
setting may not be correct in languages other than English; that's the point of putting all this in a configuration file.Sometimes putting "obvious" things into implicatures can help catch silly mistakes in your eme definitions, especially when you can specify ungrammaticality:
# can't be both high and low (though [-high -low] is okay) # seems obvious here... ([+high] => [*low]) ([+low] => [*high]) # 200 lines later, by which point we've forgotten our decision # about the relationship between high and low... # the following eme definition croaks with a warning: EH [ +high +low -tense ] # should have been: # EH [ -high -low -tense ]
Using a
*
value sets thatfeature
of theEme
to beundef
, rather than1
or0
, which is Perl's way of indicating "neither false nor true, but the question is meaningless."Note that for the time being, the implicatures are applied in the order that they are submitted to the system. Future editions may involve automatic ordering of the implicatures (see "Future Improvements").
- Feature classes
-
class AFF => [ +stop +fric ] class LOW_VOW => [ +low +vow ]
Motivation
I need a tool that constructs objects representing the featureset of a phoneme. The standard linguistic notation for this is (for the 'ch', the 'eh', and the 's' sound in "chess"):
CH [ +stop +fric -voice +palat +cons -vow ]
EH [ +vow -cons -low -high +front -tense ]
S [ +cons +fric -stop +alv -voice ]
Furthermore, I may want to be able to refer to "feature classes", that is, composite features like "affricate":
class AFF [ +stop +fric ]
(this example would match 'CH' but not 'S' or 'EH').
To complicate things further, the list of primitive features is linguistically controversial, the set of relevant classes varies from language to language, even if you agree on the theoretical primitives, and the choice of symbol set to represent the phoneme (IPA, Sampa, DARPA-bet, etc) is varied and political.
Thus, in the finest Perl sense, TMTOWTDI. The dimensions of flexibility provided are:
You, the user, define what you want to be the featureset by subclassing Lingua::FeatureMatrix::Eme
, distributed with this module. An added side bonus is that you decide whether the base unit is a Phone
or a Phoneme
(or, for that matter, a SoundUnit
or a Letter
-- that subclass is your module, and the goal is to "[put] the focus not so much onto the problem to be solved, but rather onto the person trying to solve the problem." (see Larry Wall's talk on Perl and postmodernism http://kiev.wall.org/~larry/pm.html).
You, the user, define what the feature set is, and you define how the phones (er, emes) distribute among those features, using the best of Impatience -- use the existing linguistic typographic conventions, and this module takes care of constructing your objects for you. No translating among conventions for us (that wouldn't be Lazy!).
But let's go one step further. Languages include redundancy, and sometimes it's boring (and not Lazy) to have to specify yourself that something that is [+stop]
is also [-vow +cons]
, especially if you have to specify this for every single [+stop]
consonant.
So this module also introduces the concept of an implicature -- you can say, in simple, linguistically-familiar format, that
( [+stop] => [-vow +cons] )
and this will apply for all phones in the current dataset (unless I'm speaking Berber, where this isn't necessarily true...). It's also Lazy, because the module also does the work of letting me know whether I have forgotten to specify any of the features of a given phone:
# probably missing a feature or six; would generate a warning.
T [ +cons -vow ]
Along the way, we pick up some Hubris:
Doesn't apply just to phones anymore -- we can use it for letters and ligatures, if we want.
It should be extensible to use these objects to connect to other linguistics-style programs like
Lingua::SoundChange
, not to mention homebrew pronunciation algorithms likeLingua::Soundex
.
HISTORY
- 0.01
-
Original version; created by h2xs 1.21 with options
-CAX Lingua::FeatureMatrix
- 0.02
-
Now includes lots of error-checking code for handling implicatures better. Still remaining, lots to do, but now can probably be understood by somebody who hasn't read the whole code.
Also includes a lot of documentation, among which is an elaborate "Motivation".
- 0.03
- 0.04
- 0.05
If you find any bugs or need additional features, please inform the author -- and check CPAN; this module is under development and may have recently added the feature you need.
Further reading
For some discussion and ideas about applications of feature matrices:
A phonetics description:
http://www.essex.ac.uk/speech/teaching-01/documents/df-theory.html
Here's an implementation possibility:
I bet Knuth's typography book might lead
TO DO: add more
AUTHOR
Jeremy Kahn, <kahn@cpan.org>
Special thanks to Dr. Kate Davis, who acted as the phonetics-theory sounding board for this project.
Future Improvements
- add testing cases
-
Includes understanding why the limited test cases provided here fail.
- connect to others
-
E.g.
Lingua::SoundChange
. - autosort implicatures
-
would require building a graph and toposorting it
- add diachronic/sound change functions
-
But make sure we don't rebuild the wheel.
SEE ALSO
perl.
3 POD Errors
The following errors were encountered while parsing the POD:
- Around line 716:
Expected text after =item, not a bullet
- Around line 720:
You forgot a '=back' before '=head2'
- Around line 1071:
Expected '=item *'