NAME

Treex::Tool::Segment::RuleBased - Rule based pseudo language-independent sentence segmenter

VERSION

version 2.20151102

DESCRIPTION

Sentence boundaries are detected based on a regex rules that detect end-sentence punctuation ([.?!]) followed by a uppercase letter. This class is implemented in a pseudo language-independent way, but it can be used as an ancestor for language-specific segmentation by overriding the method segment_text (using around see Moose::Manual::MethodModifiers) or just by overriding methods unbreakers, openings and closings.

See Treex::Block::W2A::EN::Segment

METHODS

get_segments: Returns list of sentences

METHODS TO OVERRIDE

segment_text: Do the segmentation (handling use_paragraphs and use_lines)
$text = split_at_terminal_punctuation($text): Adds newlines after terminal punctuation followed by an uppercase letter.
$text = apply_contextual_rules($text): Add unbreakers (<<<DOT>>>) and hard breaks (\n) using the whole context, not just a single word.
unbreakers: Returns regex that should match tokens that usually do not end a sentence even if they are followed by a period and a capital letter: * single uppercase letters serve usually as first name initials * in language-specific descendants consider adding: * period-ending items that never indicate sentence breaks * titles before names of persons etc.
openings: Returns string with characters that can appear before the first word of a sentence
closings: Returns string with characters that can appear after period (or other end-sentence symbol)

AUTHOR

Martin Popel <popel@ufal.mff.cuni.cz>

Ondřej Dušek <odusek@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install Treex::Unilang, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Treex::Unilang

CPAN shell

perl -MCPAN -e shell
install Treex::Unilang

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)