- BUGS AND ISSUES
Lingua::EN::Alphabet::Shaw - transliterate the Latin to Shavian alphabets
Thomas Thurman <email@example.com>
use Lingua::EN::Alphabet::Shaw; my $shaw = Lingua::EN::Alphabet::Shaw->new(); print $shaw->transliterate('I live near a live wire.');
The Shaw or Shavian alphabet was commissioned by the will of the playwright George Bernard Shaw in the early 1960s as a replacement for the Latin alphabet for representing English. It is designed to have a one-to-one phonemic (not phonetic) mapping with the sounds of English.
Its ISO 15924 code is "Shaw" 281.
This module transliterates English text from the Latin alphabet into the Shavian alphabet.
The API has changed since version 0.03 to be object-based.
If you find an error in the translation database, you can change it yourself at http://shavian.org.uk/wiki/ . You may download a current copy of the dataset at http://shavian.org.uk/set/ . If you want to override the database shipped with this module, place the new copy at ~/.cache/shavian/shavian-set.sqlite and it will be used in preference.
Constructor. Currently takes no arguments.
Returns the transliteration of the given phrase into the Shavian alphabet. Can handle multi-word phrases. Does a reasonable job resolving homonym ambiguity ("does he like does?").
If you pass multiple arguments, the results will be concatenated, and only the odd-numbered arguments will be transliterated. The state of homonym resolution is maintained. This allows you to embed chunks of text which should not be transliterated into the line, such as XML tags.
If a word is not found in the translation database, the transliteration routines will call a particular handler to find out what to do, with the unknown word as both its first and second arguments. (This is to allow later expansion; see BUGS AND ISSUES, below.) The result of the handler should be a string, which will be inserted into the result of the transliteration routine at the correct place.
This method allows you to set a new handler by passing it as an argument. If you pass no argument, this method returns the current handler.
The default handler only returns its argument. A replacement handler could, for example, make an attempt at guessing the transliteration; it could die, to abort the transliteration process; it could return its argument but also store the new value in a table so that a list of missing words could later be reported to the user.
There is a quasi-standard mapping of the conventional alphabet onto the Shavian alphabet. This method maps Shavian text into the conventional alphabet and vice versa. It does not transliterate. Think of this as a kind of ASCII-armouring.
Various versions of the standard map the naming dot to "G", "B", and "/". This method does not support "/", but maps both "G" and "B" to the naming dot; in reverse, it maps the naming dot to "G".
The letters "K" and "L" have no mapping to Shavian letters, and are left alone.
Certain letters in the Shavian alphabet are ligatures of pairs of other letters: because of this, these pairs should not exist separately. (For example, the letter YEW is a ligature of YEA and OOZE.) This method replaces these pairs with their ligature equivalents.
Given a block of text in the conventional alphabet which is formatted as HTML, this will make a reasonable attempt at returning the same text transliterated into the Shavian alphabet. It is aware of which tags commonly break the flow of sentences, and handles homonym resolution accordingly.
BUGS AND ISSUES
There should be a version of the main transliteration method which returned a list of hashes, each of which gave the source and destination forms of a word, part of speech and disambiguation information, and a marking of the source (CMUDict or Shavian Wiki).
It should probably be possible to transliterate in reverse, from Shavian to the conventional alphabet.
It should be possible to handle other alternative scripts, such as Deseret and Tengwar. This shouldn't be very difficult. It would also allow representation in the IPA, which would mean this module could be used for simple text-to-speech processing.
The portion of the database which is taken from CMUdict exhibits unhelpful mergers (notably father/bother). There isn't much that can be done about this except extending the Shavian wiki further. In addition, in some cases it does not use the letters ARRAY and ADO in unstressed syllables as they should be; this could and should be fixed.
It would be useful on initialisation to read a text file in a standard location, which gave a local mapping overriding the database for given words.
It would be helpful if there was a callback for any words found from the CMUDict data rather than from the Shavian Wiki data, so that the wiki could be updated.
The HTML transliterator should mark its output as being encoded in UTF-8, whatever the source encoding. (Shavian cannot be represented in any other standard encoding.)
The HTML transliterator should have an option which put a span around each word whose title was the word's spelling in the conventional alphabet, in the manner of translate.google.com.
The HTML transliterator should have an option to rewrite the destinations of links, and to add a target to them, so that it can be used by a web script to link back to itself.
The HTML transliterator should add a "generator" META tag referencing itself, if one is not already present.
The HTML transliterator should ignore sections marked as being written in non-English languages.
The HTML transliterator should have an option to allow loading documents in chunks, as
HTML::Parser already does.
The mapping() method should have an extra parameter to cause it to map in one direction only.
Most of these will be implemented before this module reaches version 1.00.
You will need a Shavian Unicode font to use this module. There are several such fonts at http://marnanel.org/shavian/fonts/ . Please be sure to get a Unicode font and not one with the "Latin mapping".
However, the Mac can handle the Shavian alphabet out of the box.
This Perl module is copyright (C) Thomas Thurman, 2009-2010. This is free software, and can be used/modified under the same terms as Perl itself.
The transliteration data is available under various free licences, which are reproduced below.
Androcles and the Lion
Part of the transliteration data was taken from the 1962 Shavian alphabet edition of "Androcles and the Lion"; this data is in the public domain.
Part of the transliteration data was taken from the Shavian Wiki, and this is available under the Creative Commons cc-by-sa licence.
Another part of the transliteration data was taken from CMUdict. Its licence is reproduced below.
Copyright (C) 1993-2008 Carnegie Mellon University. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. The contents of this file are deemed to be source code.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
This work was supported in part by funding from the Defense Advanced Research Projects Agency, the Office of Naval Research and the National Science Foundation of the United States of America, and by member companies of the Carnegie Mellon Sphinx Speech Consortium. We acknowledge the contributions of many volunteers to the expansion and improvement of this dictionary.
THIS SOFTWARE IS PROVIDED BY CARNEGIE MELLON UNIVERSITY ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL CARNEGIE MELLON UNIVERSITY NOR ITS EMPLOYEES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The part-of-speech data was taken from the Brown tagger (although the tagger built into this model is not the Brown tagger, so its first sentence is inaccurate). Its licence is also reproduced below:
This software was written by Eric Brill.
This software is being provided to you, the LICENSEE, by the Massachusetts Institute of Technology (M.I.T.) under the following license. By obtaining, using and/or copying this software, you agree that you have read, understood, and will comply with these terms and conditions:
Permission to [use, copy, modify and distribute, including the right to grant others rights to distribute at any tier, this software and its documentation for any purpose and without fee or royalty] is hereby granted, provided that you agree to comply with the following copyright notice and statements, including the disclaimer, and that the same appear on ALL copies of the software and documentation, including modifications that you make for internal use or for distribution:
Copyright 1993 by the Massachusetts Institute of Technology and the University of Pennsylvania. All rights reserved.
THIS SOFTWARE IS PROVIDED "AS IS", AND M.I.T. MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. By way of example, but not limitation, M.I.T. MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE LICENSED SOFTWARE OR DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS.
The name of the Massachusetts Institute of Technology or M.I.T. may NOT be used in advertising or publicity pertaining to distribution of the software. Title to copyright in this software and any associated documentation shall at all times remain with M.I.T., and USER agrees to preserve same.