The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Interset::Tagset::Multext - Common code for drivers of tagsets of the Multext-EAST project.

VERSION

version 2.043

SYNOPSIS

  package Lingua::Interset::Tagset::HR::Multext;
  extends 'Lingua::Interset::Tagset::Multext';

  # We must redefine the method that returns tagset identification, used by the
  # decode() method for the 'tagset' feature.
  sub get_tagset_id
  {
      # It should correspond to the last two parts in package name, lowercased.
      # Specifically, it should be the ISO 639-2 language code, followed by '::multext'.
      return 'hr::multext';
  }

  # We may add or redefine atoms for individual surface features.
  sub _create_atoms
  {
      my $self = shift;
      # Most atoms can be inherited but some have to be redefined.
      my $atoms = $self->SUPER::_create_atoms();
      $atoms->{verbform} = $self->create_atom (...);
      return $atoms;
  }

  # We must define the lists of surface features for all surface parts of speech!
  sub _create_feature_map
  {
      my $self = shift;
      my %features =
      (
          'N' => ['pos', 'nountype', 'gender', 'number', 'case', 'animateness'],
          ...
      );
      return \%features;
  }

  # We must define the list() method.
  sub list
  {
      my $self = shift;
      my $list = <<end_of_list
  Ncmsn
  Ncmsg
  Ncmsd
  ...
  end_of_list
      ;
      my @list = split(/\r?\n/, $list);
      return \@list;
  }

DESCRIPTION

Common code for drivers of tagsets of the Multext-EAST project. All the Multext-EAST tagsets use the same inventory of parts of speech and the same inventory of features (but not all features are used in all languages). Feature values are individual alphanumeric characters and they are also unified, thus if a feature value appears in several languages, it is always encoded by the same character. The tagsets are positional, i.e. the position of the value character in the tag determines the feature whose value this is. The interpretation of the positions is defined separately for every language and for every part of speech. Empty value (for unknown or irrelevant features) is either encoded by a dash ("-"; if at least one of the following features has a non-empty value) or is just omitted (at the end of the tag).

SEE ALSO

Lingua::Interset, Lingua::Interset::Tagset, Lingua::Interset::Tagset::CS::Multext, Lingua::Interset::Tagset::HR::Multext, Lingua::Interset::FeatureStructure

AUTHOR

Dan Zeman <zeman@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

This software is copyright (c) 2014 by Univerzita Karlova v Praze (Charles University in Prague).

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.