The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

TITLE

sh2xml - Convert Shoebox data to XML

SYNOPSIS

    sh2xml [-s settings_dir] [-a attrib] [-c codepage] [-d file]
            [-x stylesheet] [-e encs] [-m] [-i] [-f] infile [outfile]

Converts Shoebox data to XML based on marker hierarchy and interlinear text.

OPTIONS

    -a attrib       Default attribute name (or tag if -m) [value]
    -c codepage     Set system codepage for this process
    -d file         output file for DTD
    -e enc,enc      Add Encoding:: subsets in Perl 5.8.1
    -f              Add formatting information <shoebox-format>
    -h              Print copious help (the manpage)
    -i              Include DTD in data file (overriden by -d)
    -l lang         Pass language identifier to XML output
    -m              MDF style output with character marker support
    -p              Use XPath for output
    -s dir          Directory to find .typ files in [.]
    -x stylesheet   XSL stylesheet filename to reference in the XML file
    

If outfile is missing, it is created as the input file with extension replaced by .xml. This allows a user to drop a data file on a shortcut.

DESCRIPTION

The aim of sh2xml is to take Shoebox or Toolbox data and to convert it into a consistent XML structure. To do this, it analyses the database type (.typ file) to create an XML structure and then ensures that the data conforms to that structure. Since XML assumes its data is in a single encoding, sh2xml converts legacy encoded data into Unicode. Interlinear text is handled correctly in that the vertical interlinear relationships within a block are broken out into a tree, making it easier for later analysis and conversion. Finally, sh2xml will also embed the formatting information about each field from the type file if so requested.

Using sh2xml involves two aspects: preparing for conversion in terms of giving information about encoding conversion and even XML template output; and running the program, knowing what command line option does what. This manual is not a tutorial and so we list all the details with little or no indication of relative priority.

Running sh2xml

Here we list the various command line options and give further details on each

-a

This option specifies the name of the attribute that should be used to store the value of a field when that field has child elements. This approach is used to avoid the creation of XML mixed models. The default value of -a is value. For example, consider the following short record:

  \lx record
  \ge description

which would be converted to:

  <lx value="record">
    <ge>description</ge>
  </lx>

Notice that only the field with a child uses the attribute. This behaviour may be changed using the -m option.

-c

Specifies the default codepage to be used when converting data. In effect it specifies that sh2xml should act as though it were running on a system with the given default codepage. This means that data in languages with no given encoding conversion will be converted using this codepage.

-d

Outputs the auto-generated DTD to the given file.

-e

Perl has internal support for a large number of industry standard encodings. This option specifies which sets to pull in apart from the default set. Values include

  Byte - standard ISO 8859 type single byte encodings
  CN   - Continental China encodings including cp 936, GB 12345 and GB 2312
  JP   - Japanese encodings including cp 932 and ISO 2022
  KR   - Korean encodings including cp 949
  TW   - Taiwanese encodings including cp 950
  HanExtra - more Chinese encodings including GB 18030
  JIS2K - More Japanese encodings
  Ebcdic - surely not!
  Symbols - various symbol encodings

See man Encode::Supported or the corresponding module documentation for details of what is supported on your Perl installation.

-f

Add a formatting section before the data records in the output file. The structure of this section is:

  <!ELEMENT shoebox-format (marker)*>
  <!ELEMENT marker (language, font, interlinear?, original-marker?)>
  <!ATTLIST marker 
    name CDATA #REQUIRED
    style (char | par) #REQUIRED>

  <!ELEMENT language (#PCDATA)>

  <!ELEMENT font (#PCDATA)>
  <!ATTLIST font 
        size CDATA #REQUIRED
        style CDATA #IMPLIED
        color CDATA #IMPLIED>
        
  <!ELEMENT interlinear EMPTY>
  <!ATTLIST interlinear level CDATA #IMPLIED>

  <!ELEMENT original-marker (#PCDATA)>
-h

Print out this document

-i

Rather than export the generated DTD to an external file, this option specifies that the DTD should be included within the generated XML file

-l

This option is passed for template output as part of the initial template for output so that a language name, for example, can be output in the root element of a document.

When used with -p it specifies the language tag for the vernacular text. If the tag uses a script part, this is considered to be a suppress script for the language, otherwise the script tag comes from the xpath specification in the .typ file.

-m

MDF and perhaps other schemas support the ability to use inline markers of the form |mk{text}. sh2xml has the ability to work with these schemes. But XML can't include markup in an attribute. So the -m changes the basic output structure to include the text in its own element as the first child of the element. The -a option specifies the name of this inserted element. By default it has a value of _. For example:

  \lx record
  \ge description

is output as

  <lx>
    <_>record</_>
    <ge>
      <_>description</_>
    </ge>
  </lx>
-s

sh2xml requires access to information about the structure of the database and language information. This is held in files in the same directory as the .prj project file used when running Shoebox/Toolbox.

-x

One powerful features of XML is the ability to specify a default stylesheet that is to be used to render the file as HTML within a browser. This option sets the filename of that XSL stylesheet.

Preparing for Conversion

The basic need is to be able to specify how to convert text in a particular language into Unicode. This can be done by specifying a conversion mapping in each language file. Shoebox and Toolbox do not have a UI for specifying such conversion information, so we add information to the options/description field. The codepage specification takes the form:

  \codepage = value

The specification needs to be on a line on its own. The value can take a number of forms.

name

A mapping name either from the set of names supported by the Perl Encode module, or specified in an SIL Converters repository.

filename.tec

The path and filename of a TECkit binary mapping file. The path is relative to the settings directory.

none

No mapping should be done. The data is assumed to be in UTF-8 encoding.

When the -f option is in effect, sh2xml outputs the font used for each marker. If the data has been converted, then the font isn't appropriate to that encoding any more. To specify an appropriate font it is possible to specify this in the description field using

  \unicode_font = value

Where value is the font name to be used for the Unicode form of the data.

Template XML Generation

sh2xml has the ability to generate XML based on instructions in the database type file. The template for each marker is stored in the description for that field, and then template for the whole file is stored in the description for the whole database. The template takes the form of the XML to be output for the field. Within each template various special strings are replaced with data information:

%V

The value of the field

%M

The field marker

%S(marker)

This looks up the first occurrence of the field specified by marker and outputs its value. The field should already have been output or be encoded in Unicode already.

There are two markers that specify the template to be output.

pre_xml

This specifies what should be output when the field is processed, effectively as the start of an element.

post_xml

This specifies what should be output when the field and all its children have been output.

If there is a pre_xml but no post_xml, nothing is output when the field is finished being output along with all its children. If both are empty then the default XML output is used according to the -m and -a options.

The pre_xml and post_xml markers can be used for the whole database by using them in the database description field. In this case the %M is replaced by the value passed in via the -l option.

For example, consider the following SFM snippet

  \lx record
  \ge description

with templates for each marker as:

  Marker: lx
  Description:
    \pre_xml <entry><lex><form script="Latn">%V</form></lex>
    \post_xml </entry>

  Marker: ge
  Description:
    \pre_xml <gloss lang="eng">%V</gloss>

Then the output will be

  <entry><lex><form script="Latn">record</form></lex>
    <gloss lang="eng">description</gloss>
  </entry>

XPath based Generation

Another approach that is offered by sh2xml is to generate XML using XPath. With this approach each marker has an XPath expression associated with it. This is used by sh2xml to generate the necessary XML elements associated with each field as it is encountered.

XPath was not originally designed to be used for node creation, instead it was purely designed for node testing, to see whether a particular node conforms to a particular XPath expression. This has been extended to support node creation particularly:

.

Any absent child element in an expression is inserted into the output tree.

.

Any absent attribute element in an expression is inserted into the output tree.

.

'and' and 'or' have been extended to return nodesets rather than simple booleans. Thus 'X or Y' will return nodeset X if X is non-empty or nodeset Y. Likewise 'X and Y' will return either false if X is empty or return the nodeset Y.

.

The '=' operator has been extended to support assignment. In particular, if there is a nodeset on the left hand side, it will not be tested but all its nodes will have their values set to whatever is on the right hand side of the expression. If this too is a nodeset then only the first node is used.

The aim of such an XPath description is to provide a single description that may be used for both XML generation and for conversion from XML back to SFM.

XPath has the concept of variables and this is the mechanism used to pass the information from the field to the XPath expression for use therein.

v

$v contains the value of the field

k

$k contains the value of the key field of the record

rn

$rn contains a unique number for the record. Basically it aims to be a record number.

fn

$fn contains a unique number for the field. It doesn't guarantee to count fields accurately, but it does try to be unique for each XPath called.

If no \xpath entry is given in a field's description, one of three default xpaths are used.

.

If -m is used then the XPath used for the lx field will be:

    \xpath lx/_=$v

Obviously if -a changes then the xpath changes too.

.

Otherwise, if the field has children, then the xpath for lx will be:

    \xpath lx[@value=$s]
.

And if there are no children and the text is stored as a text node in the element then the xpath for lx will be:

    \xpath lx=$s