NAME

Treex::Tool::EnglishMorpho::Lemmatizer - rule based lemmatizer for English

VERSION

version 2.20151102

SYNOPSIS

use Treex::Tool::EnglishMorpho::Lemmatizer;
my $lemmatizer    = Treex::Tool::EnglishMorpho::Lemmatizer->new();
my ($word,  $tag) = qw( goes VBZ );
my ($lemma, $neg) = $lemmatizer->lemmatize($word, $tag);
# $lemma = 'go', $neg = 0
($lemma, $neg) = $lemmatizer->lemmatize('unhappy', 'JJ');
# $lemma = 'happy', $neg = 1

METHODS

lemmatize: Accepts pair of word and tag. Produces pair with its lemma and indication if word was negation

DESCRIPTION

Covers:

noun -s (dogs -> dog, ponies -> pony,..., mice -> mouse)
verb -s (does -> do,...)
verb -ing
verb -ed, -en
adjective/adverb -er
adjective/adverb -est
cut off negative prefixes (un|in|im|non|dis|il|ir)

Input requirements

Tokenization: doesn't should be tokenized as two words: does and n't (It will be lemmatized as do and not).
Tagging: Correct tagging (Penn style) is quite crucial for Lemmatizer to work. For example it doesn't change words with tags NN and NNP (it changes only NNS and NNPS). So (pence, NN) -> pence, but (pence, NNS) -> penny.

Differences from the previous implementation

Modul PEDT::MorphologyAnalysis uses Morpha (written in Flex) and in some cases gives different lemmatization.

Adverbs and adjectives.: Morpha leaves comparatives and superlatives unchanged. PEDT::MorphologyAnalysis does only basic analysis (later -> lat).
Capitalization of proper names
Changes of NN
Latin words: Declination of words with latin origin is not covered by any Lemmatizer rules on purpose. There are few widely known english words with latin origin which are (or should be) covered by exception files (f.e. indices NNS -> index). In my opinion, it is better, especially for translation purposes, to leave the other latin words unchanged. Mostly they will have the same form also in the target language (biological terms like Spheniscidae). BTW: Errors made by Morpha latin fallbacks are sometimes funny: sci-fi -> sci-fus, Mitsubishi -> mitsubishus, Shanghai -> shanghaus,...

TODO

this POD documentation !!!
better list of exceptions
change exceptions format from tsv to stored perl hash

AUTHOR

Martin Popel <popel@ufal.mff.cuni.cz>

COPYRIGHT

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

To install Treex::EN, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Treex::EN

CPAN shell

perl -MCPAN -e shell
install Treex::EN

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)