Joerg Tiedemann

NAME

Uplug::PreProcess::Tokenizer

SYNOPSIS

 my $tokenizer = new Uplug::PreProcess::Tokenizer( lang => 'en' );
 my @tokens = tokenizer->tokenize( 'Mr. Smith says: "What is a text anyway?"' );
 my $text = detokenize( '" Big improvement ! " says Mr. Smith .');

IMPLEMENTS

tokenize

Tokenize a given text. Returns a list of tokens.

detokenize

De-tokenize a space-separated text or a list of tokens. Returns plain text.

load_prefixes

Load language specific abbreviations and other non-breaking prefixes.

DESCRIPTION

This module heavily relies on the implementation of the tokenizer and detokenizer used in the Moses toolkit for SMT. All credits go to the original authors (Josh Schroeder and Philipp Koehn).