subalign - script for aligning the OpenSubtitlesXXXX corpora
subalign [OPTIONS] <srcdir> <trgdir>
srcdir and trgdir are directories in the subtitle corpus from the source and the target language. The script creates a corresponding sub-dir for the aligned data. For example
srcdir
trgdir
subalign en/2001/209475 et/2001/209475
aligns files in the English and Estonian collection of subtitles for a movie from 2001 with the ID 209475. The resulting files will be created in en-et/2001/209475.
en-et/2001/209475
Command line arguments for subalign:
-A ................... store alternative alignments in outdir/alt -a accept-threshold .. accept alternative subtitle pairs > score (default=0.75) -D duration-thr ...... min duration similarity (default=0.8) -M max ............... max nr of subtitle file pairs to try -L ................... skip symbolic links (when looking for files) -x score-threshold ... threshold for overlap + metascore (before aligning, default = 0.2)
Command line arguments related to srt-alignment
-S source-lang . source language ID -T target-lang . target language ID -c score ....... use cognates with LCSR>=score -r score-range . use cognates in a certain range 1..score and take best -l length ...... set minimal length of cognates (if used) -i len ......... use identical strings with length>=len -w size ........ set size for sliding window -d dic ......... use dictionary in file 'dic' -u ............. cognates/identicals that start with upper case only -r char_set .... define a set of characters to be used for matching -q ............. normalize length scores with (current) word frequencies -b ............. use "best" alignment (least empty alignments) -p nr .......... stop after <nr> candidates (when using -b) -m MAX ......... in "best" alignment: use only MAX first & MAX last (default = 10; 0 = all) -f uplug-conf .. use fallback aligner if necessary -v ............. verbose output
subalign looks at pairs of movie subtitle files in the OpenSubtitle corpora and tries to find the best pair that aligns with the least empty translation units among all alternative subtitle files. It uses a score that combines the proportion of non-empty alignment units and a score based on metadata. The latter requires meta-information stored in the subtitle files (which is available from OpenSubtitles2016)
To install Text::SRT::Align, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::SRT::Align
CPAN shell
perl -MCPAN -e shell install Text::SRT::Align
For more information on module installation, please visit the detailed CPAN module installation guide.