The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Interset::Tagset::HI::Conll - Driver for the Hindi tagset of the shared tasks at ICON 2009, ICON 2010 and COLING 2012, as used in the CoNLL data format.

VERSION

version 3.012

SYNOPSIS

  use Lingua::Interset::Tagset::HI::Conll;
  my $driver = Lingua::Interset::Tagset::HI::Conll->new();
  my $fs = $driver->decode("NN\tn\tgen-|num-sg|pers-|case-d|vib-|tam-|voicetype-");

or

  use Lingua::Interset qw(decode);
  my $fs = decode('hi::conll', "NN\tn\tgen-|num-sg|pers-|case-d|vib-|tam-|voicetype-");

DESCRIPTION

Interset driver for the Hindi tagset of the shared tasks at ICON 2009, ICON 2010 and COLING 2012, as used in the CoNLL data format. CoNLL tagsets in Interset are traditionally three values separated by tabs, coming from the CoNLL columns CPOS, POS and FEAT.

In the case of Hindi, the CoNLL data had to be filtered before collecting the input tags. The data of the ICON shared tasks were converted to CoNLL from the native Shakti Standard Fromat (SSF) and the CoNLL CPOS column contained so-called chunk tag, which we do not want to decode. The conversion procedure was modified for the COLING 2012 shared task and this data did not contain chunk tags in the CPOS column. We expect the 2012 format, that is:

The CPOS column contains the part-of-speech tag that was previously (during ICON tasks) in the POS column.

The POS column contains the value of the cat feature from the morphological analyzer. It is also a part-of-speech category but the set of tags is different, with different granularity. As these two POS tags come from different sources, there are occasional inconsistencies between their values. Inconsistent combinations may not be decoded correctly by this driver. They have been removed from the driver's list of known tags and they were not used to test the driver.

Finally the FEAT column contains features and their values. Unlike in other CoNLL tagsets, some of the features in the Hindi treebank must not be considered part of morphological tag. We have removed the following features (it is not necessary to remove them when the driver is used to decode; they will be simply ignored. However, we will not output these features when the driver is used to encode.)

lex contains lemma or stem. cat contains the same value as the POS column. chunkId identifies the chunk to which the word belongs. chunkType is either head or child. stype pertains to the entire sentence (declarative, imperative or interrogative).

Short description of the part of speech tags can be found in http://ltrc.iiit.ac.in/nlptools2010/documentation.php. More information is available in the annotators' manual at http://ltrc.iiit.ac.in/MachineTrans/publications/technicalReports/tr031/posguidelines.pdf.

SEE ALSO

Lingua::Interset, Lingua::Interset::Tagset, Lingua::Interset::FeatureStructure

AUTHOR

Dan Zeman <zeman@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

This software is copyright (c) 2017 by Univerzita Karlova (Charles University).

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.