Text::SRT::Align - sentence alignment for movie subtitles based on time overlaps
use Text::SRT::Align qw/:all/; # source and target language files in XML format # (use srt2xml to convert srt to XML) my $srcfile = "source-language-file.xml"; my $trgfile = "target-language-file.xml"; my @alignments = (); # find alignments between sentences in source and target # print the result in XCES Align format my $score = &align( $srcfile,$trgfile,\@alignments) print_ces($srcfile,$trgfile,\@alignment); # run a new alignment # - use a cognate filter to find synchronization points # - cognate filter uses string similarity threshold 0.8 # - use 'best-align' mode: find the best synchonization points my $score = &align( $srcfile,$trgfile,\@alignments, USE_COGNATES => 0.8, BEST_ALIGN => 1 );
Text::SRT:Align aligns sentences with the largest time overlap. Time information has to be available in the XML files to be aligned. Use srt2xml to convert movie subtitle files from *.srt format to XML!
srt2xml
$score = align( $srcfile, $trgfile, \@alignments [,%options] )
Alignments will be returned within the @alignment array. Possible options are
@alignment
# verbose output VERBOSE => 1 # specify hard boundaries HARD_BOUNDARES => $boundaries # use 'best-align' mode: # - find lexical matches in the beginning and at the end of each movie # - test synchronization with all possible pairs of matched lexical items # - use the one that gives the highest score # (proportion between non-emtpy and empty links) BEST_ALIGN => 1 # lexical matching for synchronization # - window size for finding matches (beginning and end in number of sentences) # - number of matches to be used in best-align mode WINDOW => $window_size MAX_MATCHES => $nr_matches # us a bilingual dictionary for finding possible translation equivalents USE_DICTIONARY => $dictionary_file # match strings (possible cognates) # - identical tokens # - minimum length of tokens to be matched # - words starting with upper-case letters only (named entities?) # - word frequency heuristics (prefer rare identical words) # - character set restrictions (for example, only alphabetic letters: \p{L}) USE_IDENTICAL => 1 TOK_LEN => $minimum_token_length UPPER_CASE => 1 USE_WORDFREQ => 1 CHAR_SET => $character_set # string-similarity-based matching using LCSR scores # - define matching threshold # - minimum token length # - use cognates with macthing scores 1..$sim_score in combination with best_align USE_COGNATES => $cognate_threshold MINLENGTH => $minimum_token_length COGNATE_RANGE => $sim_score
initialize_dictionary( $srclang, $trglang )
Load the provided dictionary if it exists for the given language pair. Return 1 if it exists and could be loaded. Return 0 otherwise.
load_lexicon( $dicfile[, $inverse])
Load lexicon from $dicfile. Optional: inverse dictionary (reverse source and target language)
print_ces( $srcfile, $trgfile, \@alignments )
Print the sentence alignments in XCES Align format.
@newtimeframes = sort_time_frames( \@oldtimeframes )
Sort time frames by their starting time. (This is necessary because some subtitles do not list the frames in chronolgical order.)
time_overlap( \@srcdata, \@trgdata )
Compute the proportion of overlapping in time between two sets of subtitles. Returns overlap-ratio = common-time / ( common-time + different-time )
This is similar to time_overlap_ratio but uses the time frames from subtitle data structures that may be synchronized using lexical anchors.
time_overlap_ratio( \@timeframes1, \@timeframes2 )
Compute the proportion of overlapping in time between two sets of times frames. Returns overlap-ratio = common-time / ( common-time + different-time )
read_time_frames( $xmlfile, \@timeframes )
Read through a subtitle file and return all time frames
Jörg Tiedemann, https://bitbucket.org/tiedemann
Add length-based option (using time slot length correlations) also in combination with character length.
Please report any bugs or feature requests to https://bitbucket.org/tiedemann/subalign.
Copyright 2013 Jörg Tiedemann.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
1 POD Error
The following errors were encountered while parsing the POD:
Non-ASCII character seen before =encoding in 'Jörg'. Assuming UTF-8
To install Text::SRT::Align, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::SRT::Align
CPAN shell
perl -MCPAN -e shell install Text::SRT::Align
For more information on module installation, please visit the detailed CPAN module installation guide.