The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::Perfide::PartialAlign - Split large bitexts into smaller files.

VERSION

Version 0.01_03

SYNOPSIS

Perhaps a little code snippet.

    use Text::Perfide::PartialAlign;

    my $foo = Text::Perfide::PartialAlign->new();
    ...

EXPORT

A list of functions that can be exported. You can delete this section if you don't export anything, such as for a purely object-oriented module.

SUBROUTINES/METHODS

build_chain

calc_common_tokens

calc_pairs

subcorpora2files

Writes subcorpora to files.

usage

Prints a short description and usage details.

tokenFreq

Receives an array of lines of a text (each line is an array of words). Calculates the frequency of each word.

hapaxes

Receives hash token => freq. Returns hash with elements with freq == 1

hapaxPositions

Builds an hash with term => positions, where position is the number of the sentence in which term occurs.

bagSort

...

uniqSort

Sorts an array of pairs and removes duplicated pairs.

less

Receives two pairs. Checks if both coordinates of the first pair are lower than the second pair.

less_relaxed

Receives two pairs...

less_or_equal

Receives two pairs. Checks if both coordinates of the first pair are lower or equal than the second pair's.

maximalChain

Receives an array of pairs. Using dynamic programming, selects the maximal chain.

findCommonHap

Finds unique terms common to both corpora. Notion of equality can be extended with two lists of correspondences.

findCommonHap($l1Hap,$l2Hap)

Returns a reference to a hash containing the elements common to the hashes pointed by the references $l1Hap and $l2Hap.

findCommonHap($l1Hap,$l2Hap,$l1_to_l2,$l2_to_l1)

$l1_to_l2 and $l2_to_l1 are references to hashes containing correspondences between words in language1 and language2 and vice-versa.

selectFromChain

Selects a chain trying to obbey the maximalChunkSize constraint.

get_corpus

Given a file name, splits the segments and words into an array of arrays.

Returns: a reference to the array of arrays, a reference to an array of pairs with the offsets of the start and end of each segment, a reference to the full text

strInterval

Given a corpus and a start and end positions, returns a string with the contents within the given range.

strInterval($corpus,$first,$last)

Concatenates all the words in the lines comprised in the $first..$last-1 range from corpus.

strInterval($corpus,$first,$last,$offsets);

Retrieves from the original text the substring from the begining of the segment $first to the end of the segment $last;

parseCorrespFile

Parses a given file with correspondences between two given languages. File must follow the following DSL: file : header correspondence* header: 'langs:' L1, L2 correspondence : term (',' term)* '=' term (',' term)* term : word (\s word)*

Does not yet support multi-word terms nor multi-term correspondences!

seg_split

token_split

AUTHOR

Andre Santos, <andrefs at cpan.org>

BUGS

Please report any bugs or feature requests to bug-text-perfide-partialalign at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Perfide-PartialAlign. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Text::Perfide::PartialAlign

You can also look for information at:

ACKNOWLEDGEMENTS

Based on the original script partialAlign.py bundled with hunalign -- http://mokk.bme.hu/resources/hunalign/ .

Thanks to Daniel Varga for helping us to understand how partialAlign.py works.

LICENSE AND COPYRIGHT

Copyright 2012 Andre Santos.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.