NAME

Lingua::Interset::Atom - Atomic driver for a surface feature.

VERSION

version 3.014

SYNOPSIS

  use Lingua::Interset::Atom;

  my $atom = Lingua::Interset::Atom->new
  (
      'surfeature'    => 'gender',
      'decode_map' =>

          { 'M' => ['gender' => 'masc', 'animateness' => 'anim'],
            'I' => ['gender' => 'masc', 'animateness' => 'inan'],
            'F' => ['gender' => 'fem'],
            'N' => ['gender' => 'neut'] },

      'encode_map' =>

          { 'gender' => { 'masc' => { 'animateness' => { 'inan' => 'I',
                                                         '@'    => 'M' }},
                          'fem'  => 'F',
                          '@'    => 'N' }}
  );

DESCRIPTION

Atom is a special case of a tagset driver. As the name suggests, the surface tags are considered atomic, i.e. indivisible. It provides environment for easy mapping between surface strings and Interset features.

While Atom can be used to implement drivers of tagsets whose tags are not structured (such as en::penn or sv::mamba), they should also provide means of defining “sub-drivers” for individual surface features within drivers of complex tagsets. For example, the Czech tags in the Prague Dependency Treebank are always strings of 15 characters where the i-th position in the string encodes the i-th surface feature (which may or may not directly correspond to a feature in Interset). A driver for the PDT tagset could internally construct atomic drivers for PDT gender, number, case etc.

ATTRIBUTES

surfeature

Name of the surface feature the atom describes. If the atom describes a whole tagset, the tagset id could be stored here. The surface features may be structured differently from Interset, e.g. there might be an agreement feature, which would map to the Interset features of person and number.

decode_map

A compact description of mapping from the surface tags to the Interset feature values. It is a hash reference. Hash keys are surface tags. Hash values are references to arrays of assignments. The arrays must have even number of elements and every pair of elements is a feature-value pair.

Example:

  { 'M' => ['gender' => 'masc', 'animateness' => 'anim'],
    'I' => ['gender' => 'masc', 'animateness' => 'inan'],
    'F' => ['gender' => 'fem'],
    'N' => ['gender' => 'neut'] }

Vertical bars may be used to separate multiple values of one feature. The other feature can have a structured value, so you can use standard Perl syntax to describe hash and/or array references.

  { 'name_of_dog' => [ 'pos' => 'noun', 'nountype' => 'prop', 'other' => { 'named_entity_type' => 'dog' } ],
    'wh_word'     => [ 'pos' => 'noun|adj|adv', 'prontype' => 'int|rel' ] }

encode_map

A compact description of mapping from the Interset feature structure to the surface tags. It is a hash reference, possibly with nested hashes. The top-level hash must always have just one key, which is a name of an Interset feature. (It could be encoded without the hash but I believe that the whole map looks better this way.)

The top-level key leads to a second-level hash, which is indexed by the values of the feature. It is not necessary that all possible values are listed. A special value @, if present, means “everything else”. It is recommended to always mark the default value using @. Even if we list all currently known values of the feature, new values may be introduced to Interset in future and we do not want to have to get back to all tagsets and update their encoding maps. (On the other hand, if there are values that the decode() method of the current atom does not generate but we still have a preferred output for them, the preference must be made explicit. For instance, if the language does not have the pluperfect tense, it may still define that it be encoded the same way as the past tense.)

A feature may have a multi-value (several values joined and separated by vertical bars). A value (multi- or not) is always first sought using the exact match. If the search fails, both the current feature value and the keys of the value hash are treated as lists of values and their largest intersection is sought for. If no overlap is found, the default @ decision is taken.

Example:

  { 'gender' => { 'masc'      => { 'animateness' => { 'inan' => 'I',
                                                      '@'    => 'M' }},
                  'fem|masc'  => 'T',
                  'fem'       => 'F',
                  '@'         => 'N' }}

The other feature, if queried by the map, receives special treatment. First, the tagset attribute must be filled in and its value is checked against the tagset feature. The value is only processed if the tagset ids match (otherwise an empty value is assumed). String values and array values (given as vertical-bar-separated strings) are processed similarly to normal features. In addition, it is possible to have a hash of subfeatures stored in other, and to query them as 'other/subfeature'.

Example:

  { 'other/subfeature1' => { 'x' => 'X',
                             'y' => 'Y',
                             '@' => { 'other/subfeature2' => { '1' => 'S',
                                                               '@' => '' }}}}

The corresponding decode_map would be in this case:

  {
      'X' => ['other' => {'subfeature1' => 'x'}],
      'Y' => ['other' => {'subfeature1' => 'y'}],
      'S' => ['other' => {'subfeature2' => '1'}]
  }

Note that in general it is not possible to automatically derive the encode_map from the decode_map or vice versa. However, there are simple instances of atoms where this is possible.

tagset

Optional identifier of the tagset that this atom is part of. It is required when the encoding map queries values of the other feature (to check against the tagset feature that the values come from the same tagset). Default is empty string.

METHODS

decode()

  my $fs  = $driver->decode ($tag);

Takes a tag (string) and returns a Lingua::Interset::FeatureStructure object with corresponding feature values set.

decode_and_merge_hard()

  my $fs  = $driver1->decode ($tag1);
  $driver2->decode_and_merge_hard ($tag2, $fs);

Takes a tag (string) and a Lingua::Interset::FeatureStructure object. Adds the feature values corresponding to the tag to the existing feature structure. Replaces previous values in case of conflict.

decode_and_merge_soft()

  my $fs  = $driver1->decode ($tag1);
  $driver2->decode_and_merge_soft ($tag2, $fs);

Takes a tag (string) and a Lingua::Interset::FeatureStructure object. Adds the feature values corresponding to the tag to the existing feature structure. Merges lists of values in case a feature had already a value set.

encode()

  my $tag = $driver->encode ($fs);

Takes a Lingua::Interset::FeatureStructure object and returns the tag (string) in the given tagset that corresponds to the feature values. Note that some features may be ignored because they cannot be represented in the given tagset.

list()

  my $list_of_tags = $driver->list();

Returns the reference to the list of all known tags in this particular tagset. This is not directly needed to decode, encode or convert tags but it is very useful for testing and advanced operations over the tagset. Note however that many tagset drivers contain only an approximate list, created by collecting tag occurrences in some corpus.

merge_atoms()

  $atom0->merge($atom1, $atom2, ..., $atomN);

Takes references to one or more other atoms and merges (adds) their decoding maps to our decoding map. Ordering of the atoms matters: if several atoms define decoding of the same feature, the first definition will be used and the others will be ignored. The atom $self comes first.

Note that the encoding map will not change. This method is useful for tagsets where feature values appear without naming the feature. For example, instead of

  gender=masc|number=sing|case=nom

the tag only contains

  masc|sing|nom

Such tagsets require asymmetric processing. There is one big atom that decodes any feature value regardless of which feature it belongs to. But it does not encode anything. Then there are many small atoms for individual features. We cannot use them for decoding because we do not know which atom to pick until we have decoded the value. But we will use them for encoding because we know which features and in what order we want to encode for a particular part of speech.

We could define both the big decoding atom and the small encoding atoms manually. There is a drawback to it: we would be describing each feature twice at two different places in the source code. The merge_atoms() method gives us a better way: we will define the small atoms (both for decoding and encoding) and then create the big decoding atom by merging the small ones:

  # This code goes in a tagset driver, e.g. Lingua::Interset::Tagset::CS::Mytagset,
  # in a function that builds all necessary atoms, e.g. sub _create_atoms.
  my %atoms;
  $atoms{genderanim} = $self->create_atom
  (
      'surfeature' => 'genderanim',
      'decode_map' =>
      {
          'ma' => ['gender' => 'masc', 'animateness' => 'anim'],
          'mi' => ['gender' => 'masc', 'animateness' => 'inan'],
          'f'  => ['gender' => 'fem'],
          'n'  => ['gender' => 'neut']
      },
      'encode_map' =>
      {
          'gender' => { 'masc' => { 'animateness' => { 'inan' => 'mi',
                                                       '@'    => 'ma' }},
                        'fem'  => 'f',
                        '@'    => 'n' }
      }
  );
  $atoms{number} = $self->create_simple_atom
  (
      'intfeature' => 'number',
      'simple_decode_map' =>
      {
          'sg' => 'sing',
          'pl' => 'plur'
      }
  );
  $atoms{feature} = $self->create_atom
  (
      'surfeature' => 'feature',
      'decode_map' => {},
      'encode_map' => { 'pos' => {} } # The encoding map cannot be empty even if we are not going to use it.
  );
  $atoms{feature}->merge_atoms($atoms{genderanim}, $atoms{number});

SEE ALSO

Lingua::Interset::Tagset, Lingua::Interset::FeatureStructure

AUTHOR

Dan Zeman <zeman@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

This software is copyright (c) 2019 by Univerzita Karlova (Charles University).

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.