The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::SRT::Align - sentence alignment for movie subtitles based on time overlaps

SYNOPSIS

 use Text::SRT::Align qw/:all/;

 # source and target language files in XML format
 # (use srt2xml to convert srt to XML)

 my $srcfile    = "source-language-file.xml";
 my $trgfile    = "target-language-file.xml";
 my @alignments = ();

 # find alignments between sentences in source and target
 # print the result in XCES Align format

 my $score = &align( $srcfile,$trgfile,\@alignments)
 print_ces($srcfile,$trgfile,\@alignment);

 # run a new alignment
 # - use a cognate filter to find synchronization points
 # - cognate filter uses string similarity threshold 0.8
 # - use 'best-align' mode: find the best synchonization points

 my $score = &align( $srcfile,$trgfile,\@alignments,
                     USE_COGNATES => 0.8,
                     BEST_ALIGN => 1 );

DESCRIPTION

Text::SRT:Align aligns sentences with the largest time overlap. Time information has to be available in the XML files to be aligned. Use srt2xml to convert movie subtitle files from *.srt format to XML!

Exported Functions

$score = align( $srcfile, $trgfile, \@alignments [,%options] )

Alignments will be returned within the @alignment array. Possible options are

  # verbose output
  VERBOSE => 1

  # specify hard boundaries
  HARD_BOUNDARES => $boundaries

  # use 'best-align' mode:
  # - find lexical matches in the beginning and at the end of each movie
  # - test synchronization with all possible pairs of matched lexical items
  # - use the one that gives the highest score 
  #   (proportion between non-emtpy and empty links)
  BEST_ALIGN => 1

  # lexical matching for synchronization
  # - window size for finding matches (beginning and end in number of sentences)
  # - number of matches to be used in best-align mode
  WINDOW => $window_size
  MAX_MATCHES => $nr_matches

  # us a bilingual dictionary for finding possible translation equivalents
  USE_DICTIONARY => $dictionary_file

  # match strings (possible cognates)
  # - identical tokens
  # - minimum length of tokens to be matched
  # - words starting with upper-case letters only (named entities?)
  # - word frequency heuristics (prefer rare identical words)
  # - character set restrictions (for example, only alphabetic letters: \p{L})
  USE_IDENTICAL => 1
  TOK_LEN => $minimum_token_length
  UPPER_CASE => 1
  USE_WORDFREQ => 1
  CHAR_SET => $character_set

  # string-similarity-based matching using LCSR scores
  # - define matching threshold
  # - minimum token length
  # - use cognates with macthing scores 1..$sim_score in combination with best_align
  USE_COGNATES => $cognate_threshold
  MINLENGTH => $minimum_token_length
  COGNATE_RANGE => $sim_score

initialize_dictionary( $srclang, $trglang )

Load the provided dictionary if it exists for the given language pair. Return 1 if it exists and could be loaded. Return 0 otherwise.

load_lexicon( $dicfile[, $inverse])

Load lexicon from $dicfile. Optional: inverse dictionary (reverse source and target language)

Print the sentence alignments in XCES Align format.

@newtimeframes = sort_time_frames( \@oldtimeframes )

Sort time frames by their starting time. (This is necessary because some subtitles do not list the frames in chronolgical order.)

time_overlap( \@srcdata, \@trgdata )

Compute the proportion of overlapping in time between two sets of subtitles. Returns overlap-ratio = common-time / ( common-time + different-time )

This is similar to time_overlap_ratio but uses the time frames from subtitle data structures that may be synchronized using lexical anchors.

time_overlap_ratio( \@timeframes1, \@timeframes2 )

Compute the proportion of overlapping in time between two sets of times frames. Returns overlap-ratio = common-time / ( common-time + different-time )

read_time_frames( $xmlfile, \@timeframes )

Read through a subtitle file and return all time frames

AUTHOR

Jörg Tiedemann, https://bitbucket.org/tiedemann

TODO

Add length-based option (using time slot length correlations) also in combination with character length.

BUGS AND SUPPORT

Please report any bugs or feature requests to https://bitbucket.org/tiedemann/subalign.

SEE ALSO

LICENSE AND COPYRIGHT

Copyright 2013 Jörg Tiedemann.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 1902:

Non-ASCII character seen before =encoding in 'Jörg'. Assuming UTF-8