Treex::Tool::Segment::RuleBased - Rule based pseudo language-independent sentence segmenter
Sentence boundaries are detected based on a regex rules that detect end-sentence punctuation ([.?!]) followed by a uppercase letter. This class is implemented in a pseudo language-independent way, but it can be used as an ancestor for language-specific segmentation by overriding the method
around see Moose::Manual::MethodModifiers) or just by overriding methods
Returns list of sentences
Do the segmentation (handling
- $text = split_at_terminal_punctuation($text)
Adds newlines after terminal punctuation followed by an uppercase letter.
- $text = apply_contextual_rules($text)
Add unbreakers (
<<<DOT>>>) and hard breaks (
\n) using the whole context, not just a single word.
Returns regex that should match tokens that usually do not end a sentence even if they are followed by a period and a capital letter: * single uppercase letters serve usually as first name initials * in language-specific descendants consider adding: * period-ending items that never indicate sentence breaks * titles before names of persons etc.
Returns string with characters that can appear before the first word of a sentence
Returns string with characters that can appear after period (or other end-sentence symbol)
Martin Popel <email@example.com>
Ondřej Dušek <firstname.lastname@example.org>
Copyright © 2011-2012 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.