The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Treex::Block::W2A::EN::FixTokenization - fix some issues in output of tokenizer

VERSION

version 2.20151102

DESCRIPTION

Some abbreviations (with periods) are merged into one token. For example "e. g." is in Penn Treebank one token (with tag FW). Using only Treex::Block::W2A::EN::Tokenize we get four tokens: e . g . which may be distributed by the parser into different clauses. And this is hard to fix afterwards.

OVERRIDEN METHODS

from Treex::Core::Block

process_atree

AUTHOR

Martin Popel <popel@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

Copyright © 2009 - 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.