The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Text::TEI::Collate - a collation program for variant manuscript texts

SYNOPSIS

  use Text::TEI::Collate;
  my $aligner = Text::TEI::Collate->new();

  # Read from strings.
  my @manuscripts;
  foreach my $str ( @strings_to_collate ) {
    push( @manuscripts, $aligner->read_source( $str ) );
  }
  $aligner->align( @manuscripts; );

  # Read from files.  Also works for XML::LibXML::Document objects.
  @manuscripts = ();
  foreach my $xml_file ( @TEI_files_to_collate ) {
    push( @manuscripts, $aligner->read_source( $xml_file ) )
  }
  $aligner->align( @manuscripts );

  # Read from a JSON input.
  @manuscripts = $aligner->read_source( $JSON_string );
  $aligner->align( @manuscripts );
  

DESCRIPTION

Text::TEI::Collate is the beginnings of a collation program for multiple (transcribed) manuscript copies of a known text. It is an object-oriented interface, mostly for the convenience of the author and for the ability to have global settings.

The object is the alignment engine, or "aligner". The methods that a user will care about are "read_source" and "align", as well as the various output methods; the other methods in this file are public in case a user needs a subset of this package's functionality.

An aligner takes two or more texts; the texts can be strings, filenames, or XML::LibXML::Document objects. It returns two or more Manuscript objects -- one for each text input -- in which identical and similar words are lined up with each other, via empty-string padding.

Please see the documentation for Text::TEI::Collate::Manuscript and Text::TEI::Collate::Word for more information about the manuscript and word objects.

METHODS

new

Creates a new aligner object. Takes a hash of options; available options are listed.

debug - Default 0. The higher the number (between 0 and 3), the more the debugging output.
distance_sub - A reference to a function that calculates a Levenshtein-like distance between two words. Default is Text::WagnerFischer::distance.
fuzziness - The maximum allowable word distance for an approximate match, expressed as a percentage of Levenshtein distance / word length.
canonizer - Takes a subroutine ref. The sub should take a string and return a string. If defined, it will be called to produce a canonical form of the string in question. Useful for getting rid of ligatures, un-composing characters, correcting common spelling mistakes, etc.

read_source

Pass in a word source (a plaintext file, a TEI XML file, or a JSON structure) and a set of options, and get back one or more manuscript objects that can be collated. Options include:

canonizer - reference to a subroutine that returns the canonized (e.g. spell- corrected) form of the original word.
comparator - reference to a subroutine that returns the normalized comparison string (e.g. all lowercase, no accents) for a word.
encoding - The encoding of the word source if we are reading from a file. Defaults to utf-8.
sigil - The sigil that should be assigned to this manuscript in the collation output. Should be a valid XML attribute value. This can also be read from a TEI XML source.
identifier - A string to identify this manuscript (e.g. library, MS number). Can also be read from a TEI <msdesc/> element.

align

The meat of the program. Takes a list of Text::TEI::Collate::Manuscript objects (created by new_manuscript above.) Returns the same objects with their wordlists collated.

OUTPUT METHODS

to_json

Takes a list of aligned manuscripts and returns a data structure suitable for JSON encoding; documented at http://gregor.middell.net/collatex/api/collate

to_tei

Takes a list of aligned Manuscript objects and returns a fairly simple TEI XML document in parallel segmentation format, with the words lexically marked as such. At the moment returns a single paragraph, with the original div and paragraph breaks for each witness marked as a <witDetail/> in the apparatus.

to_graphml

Takes a list of aligned manuscript objects and returns a GraphML document that represents the collation as a variant graph. Words in the same location with the same canonized form are treated as the same node.

to_svg

Takes a list of aligned manuscript objects and returns an SVG representation of the variant graph, as described for the to_graphml method.

to_graph

Base method for graph-based output - create the (Graph::Easy) graph that will be used to generate graphml or svg.

BUGS / TODO

  • Refactor the string matching; currently it's done twice

  • Proper documentation

AUTHOR

Tara L Andrews <aurum@cpan.org>