The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Interset::Tagset::UR::Conll - Driver for the tagset of the Hyderabad Urdu Treebank, as used in the CoNLL data format.

VERSION

version 3.015

SYNOPSIS

  use Lingua::Interset::Tagset::UR::Conll;
  my $driver = Lingua::Interset::Tagset::UR::Conll->new();
  my $fs = $driver->decode("NN\tn\tgen-|num-sg|pers-|case-d|vib-|tam-|voicetype-");

or

  use Lingua::Interset qw(decode);
  my $fs = decode('ur::conll', "NN\tn\tgen-|num-sg|pers-|case-d|vib-|tam-|voicetype-");

DESCRIPTION

Interset driver for the tagset of the Urdu treebank from Hyderabad, as used in the CoNLL data format. CoNLL tagsets in Interset are traditionally three values separated by tabs, coming from the CoNLL columns CPOS, POS and FEAT.

In the case of Urdu, the CoNLL data had to be filtered before collecting the input tags. The data of the ICON shared tasks were converted to CoNLL from the native Shakti Standard Fromat (SSF) and the CoNLL CPOS column contained so-called chunk tag, which we do not want to decode.

The POS column contains the value of the cat feature from the morphological analyzer. It is also a part-of-speech category but the set of tags is different, with different granularity. As these two POS tags come from different sources, there are occasional inconsistencies between their values. Inconsistent combinations may not be decoded correctly by this driver. They have been removed from the driver's list of known tags and they were not used to test the driver.

Finally the FEAT column contains features and their values. Unlike in other CoNLL tagsets, some of the features in the Urdu treebank must not be considered part of morphological tag. We have removed the following features (it is not necessary to remove them when the driver is used to decode; they will be simply ignored. However, we will not output these features when the driver is used to encode.)

lex contains lemma or stem. cat contains the same value as the POS column. chunkId identifies the chunk to which the word belongs. chunkType is either head or child. stype pertains to the entire sentence (declarative, imperative or interrogative).

Short description of the part of speech tags can be found in http://ltrc.iiit.ac.in/nlptools2010/documentation.php. More information is available in the annotators' manual at http://ltrc.iiit.ac.in/MachineTrans/publications/technicalReports/tr031/posguidelines.pdf.

SEE ALSO

Lingua::Interset, Lingua::Interset::Tagset, Lingua::Interset::FeatureStructure

AUTHOR

Dan Zeman <zeman@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

This software is copyright (c) 2019 by Univerzita Karlova (Charles University).

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.