- Encoding Conversion
These notes are cursory and in time I hope they will develop into a full manual.
Toolbox has taken over from the development of Shoebox. It supports Unicode and other new features. Since much of the basic file format is the same between Shoebox and Toolbox, with Toolbox being a superset of Shoebox, the Shoebox utilities will work quite happily with Toolbox files. Therefore, wherever Shoebox is mentioned in this documentation, the reader should also read Toolbox, unless Toolbox is explicitly mentioned in contrast to Shoebox.
Each of the programs is a command line program and a summary help page is available by simply typing the command with no options and pressing return. This will list all the options available and some notes on their use.
The purpose of
sh2xml is to convert a Shoebox database into XML. While Shoebox has the ability to export to XML files, the purpose of a separate offline utility is to provide XML that is: consistent to a DTD, in Unicode and with supporting information.
sh2xml command is:
sh2xml -s settings_dir infile outfile
-s option specifies the directory containing the
.prj file. I.e it is the directory containing all the
.lng files referenced by this database.
Given this information
sh2xml creates an XML file based on the hierarchy given in the database type. If fields are missing,
sh2xml inserts them to ensure that conformance to the DTD is ensured.
sh2xml has a number of other command line options that allow for specifying whether formatting information should be output, etc. One particularly useful option is
-x which is used to specify an XSL stylesheet filename to be inserted into the XML file. Then when the file is rendered, it will be processed by the given stylesheet for rendering purposes.
Unless all the data in the database is already in Unicode, it is necessary for it to be converted into Unicode ready for creation of the XML file. Information regarding the encoding of particular data is specified in the Language Properties associated with a field. In Toolbox, for example, it is possible to specify that a particular language is stored in Unicode.
sh2xml will use this information to know that no data conversion is necessary on this data.
For other encodings, it is necessary to tell
sh2xml how to convert the data to Unicode. If no other information is available,
sh2xml will assume that the data is stored in the system codepage (or whatever codepage is specified by the
-c option). But that is often not the case. There are other ways of converting data, particularly, Windows codepages and TECkit.
Later in this document is a section on Encoding Conversion that describes
encrem and an encoding registry.
sh2xml interacts with this registry to do data conversion.
encrem works on the basis of encodings having names and then giving details of how to convert from such encodings to Unicode.
sh2xml therefore needs a name for a particular encoding and then can use the encoding conversion registry to work out how to do the data conversion. So, for each language that needs data conversion, we need to give an encoding name. This is done by storing information in the language properties itself. In the language properties for a particular language there is a tab labelled "Options". This tab has a comment field. We store the encoding name in the comment field by typing:
on a line by itself in the comment field. The encoding_name may be the name of an encoding in the encoding registry or it may be a number corresponding to a Windows codepage. Notice that when the language properties are saved and reopened, the
\codepage entry will be preceded by a space, this is normal and not a problem. Line initial spaces are ignored by
sh2xml processes interlinear text into its constituents. Thus an interlinear block consisting of text, morpheme breaks, gloss and part of speech will be broken into individual words with their morpheme breaks each with a gloss and part of speech, rather than four lines of text. This allows for easier processing of interlinear text in XML.
sh2xml works out the interlinear structure itself from the database type information in the settings directory.
There may be problems with processing interlinear text that is stored in Unicode. This is an urgent TODO.
sh2sh is very similar to
sh2xml except that rather than outputting XML, it outputs a unicode Shoebox database file. Thus it converts all fields too or from Unicode according to the encoding information in the language properties.
Since the resulting encodings are different, while the field markers are the same, the languages associated with each field will be different and so, in effect, the database type is different. Therefore,
sh2sh removes the database type heading from the file and the output file has to be imported into Toolbox using a different database type, when ready.
shintr does the same interlinear analysis that
sh2xml does, but it does no data conversion and is aimed towards producing an intermediate shoebox file ready for conversion to RTF. The aim of the combination of
sh_rtf is to be able to produce nicely typeset interlinear text for use in Word. It does this by using equation fields. This makes each interlinear block into, effectively, a single character. Moving blocks around (for discourse charting) or having a long phrase wrap at the end of the line are some of the advantages of this approach.
shintr uses styles to control layout. Within the interlinear block the style associated with each line is a text style. By setting the font formatting for a text style to invisible, the appropriate line in the block will disappear.
Setting up to use
shintr involves ensuring that two magic fields are available in the interlinear text database type.
This marker name (usually associated with the
\_RTFmarker) is used to pass RTF code from
sh_rtfwithout it being processed.
This marker (usually given the name
Interlinear block) is the style used for the whole interlinear block paragraph. The actual lines in the block are given character styles.
in addition, all the markers in the interlinear block need to be marked as character styles otherwise
sh_rtf will convert them into paragraphs rather than as running text.
shintr needs to know about database type information the command line is of the form:
shintr -s settings_dir infile outfile
This program emulates the Shoebox RTF export process but with some enhancements:
It supports Non-Roman scripts better through passing character set information through to RTF.
It supports the
_RTF ONLY_) marker name to pass RTF data through unchanged
It has the ability to generate two column text for annotations via command line options.
The primary purpose of this program is for use with
shintr but it can be used for simple conversion from Shoebox files to RTF. But this is probably best done by the Shoebox program itself, unless you are having character set problems.
Line based merging is the process of taking two files and a common ancestor and creating a third file from the three which incorporates both sets of changes the two files have made to the common ancestor. This is a powerful concept when two different people have edited the same file. Such a tool can create a file which is a combination of the edits that the two people have made. If there is a possible clash this is identified and a human has to edit the file to resolve the clash.
The problem with line based merging is that it doesn't take into account the record structure of a Shoebox file. It is possible to really make a mess of a Shoebox database using a line based merge. Instead a merge needs to take into account the record and field structure of a database. In addition it needs to account for there being multiple records with the same record field.
shdiff3 is such a program. Give it a common ancestor file and two database and it will produce a new database incorporating both sets of changes. If there is a clash this is marked in the database using a clash marker (
If the files are not Shoebox file, then the normal diff3 program is called. This allows this program to be used within svn for intelligent merging.
This version of merging allows any number of files with a common ancestor to be used and all the changes to be incorporated in one go.
Various of the utilities allow the conversion of data to or from Unicode. The basic principle of data conversion is that the byte encoding is given a name and this is used to look up a mechanism for converting too or from Unicode.
There are a number of different ways of converting data: system codepages, internal Perl encodings, TECkit, etc. What would be nice is if there were one place to look that would tell how to convert from a given encoding to Unicode.
For this system, we use a thing called the Encoding Registry which is an XML file containing information about encodings and how they are converted; fonts and how they relate to encodings and how the various mappings are implemented. For the most part, you as a user don't need to know anything about the specifics of the XML format, but you will need to interact with the encoding registry using tools.
One important tool is
encrem the encoding registry manager. It is a command line tool that allows you to enter multiple commands into one session (or to even pipe those commands from a text file to do automatic installation, etc.).
encrem looks in the registry for the encoding registry and if it can't find it will use one you specify on the command line:
encrem -r possibly_new.xml
It then tells you which file it is actually using (whether it found it in the system registry or is using the one you specify). If you are sure you have an existing encoding registry, you don't need to use the
-r option to
encrem. The next step is to possibly add an empty template to the registry ready for adding new encodings and mappings and then to register that file with the system registry:
encrem -r possibly_new.xml encrem: create encrem: register encrem: exit
Notice the different command lines. You can get help at any time by typing a command name followed by
help or simply
help to get a short list of commands.
Now that we are sure we have an encoding registry file, we can start adding information to it:
encrem encrem: add-encoding