Treex::Block::W2A::Segment - rule based segmentation to sentences
version 2.20151102
# in scenario W2A::Segment use_paragraphs=1 use_lines=0
Sentence boundaries are detected based on a regex rules that detect end-sentence punctuation ([.?!]) followed by an uppercase letter. This class is implemented in a pseudo language-independent way, but it can be used as a base class for language-specific segmentation by overriding the method get_segments (using around see Moose::Manual::MethodModifiers). The actual implementation is delegated to Treex::Tool::Segment::RuleBased.
get_segments
around
Should paragraph boundaries be preserved as sentence boundaries? Paragraph boundary is defined as two or more consecutive newlines.
Should newlines in the text be preserved as sentence boundaries? However, if you want to detect sentence boundaries just based on newlines and nothing else, use rather W2A::SegmentOnNewlines.
Should very long segments (longer than the given number of words) be split? The number of words is only approximate; detected by counting whitespace only, not by full tokenization. Set to zero to disable this function completely (default is 250 as longer sentences often cause the parser to fail).
Minimum number of words on a line to toggle list detection rules, 0 = never, 1 = always (default: 100). The number of words is detected by counting whitespace only.
Treex::Tool::Segment::RuleBased
Treex::Block::W2A::EN::Segment
Martin Popel <popel@ufal.mff.cuni.cz>
Copyright © 2011-2012 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install Treex::Unilang, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Treex::Unilang
CPAN shell
perl -MCPAN -e shell install Treex::Unilang
For more information on module installation, please visit the detailed CPAN module installation guide.