NAME

Lingua::Interset::FeatureStructure - Definition of morphosyntactic features and their values.

VERSION

version 3.015

SYNOPSIS

  use Lingua::Interset::FeatureStructure;

  print(Lingua::Interset::FeatureStructure->known_features(), "\n");

DESCRIPTION

DZ Interset is a universal framework for reading, writing, converting and interpreting part-of-speech and morphosyntactic tags from multiple tagsets of many different natural languages.

The FeatureStructure class defines all morphosyntactic features and their values used in DZ Interset. An object of this class represents a morphosyntactic tag for a natural language word.

More information is given at the DZ Interset project page, https://wiki.ufal.ms.mff.cuni.cz/user:zeman:interset:features.

METHODS

set()

A generic setter for any feature. These two statements do the same thing:

  $fs->set ('pos', 'noun');
  $fs->set_pos ('noun');

If you want to set multiple values of a feature, there are several ways to do it:

  $fs->set ('tense', ['pres', 'fut']);
  $fs->set ('tense', 'pres', 'fut');
  $fs->set ('tense', 'pres|fut');

All of the above mean that the word is either in present or in future tense.

Note that the 'other' feature behaves differently. Its value can be structured, set() will keep the structure and will not try to interpret it.

Using generic set() is more flexible than using specialized setters such as set_pos(). Even if flexibility is not needed it is recommended to avoid the specialized setters and use the generic set() method. If multiple values are set using an array reference, the specialized setters will not create a deep copy of the array, they will only copy the reference. Generic set() will create a deep copy. The array will thus not be shared among several feature structures. If someone later retrieves the array reference via get(), and decides to modify the array, they will probably expect to change only that particular feature structure and not others that happen to use the same array.

add()

  $fs->add ('pos' => 'conj', 'conjtype' => 'coor');

Sets several features at once. Takes a list of value assignments, i.e. an array of an even number of elements (feature1, value1, feature2, value2, ...) This is useful when defining decoders from physical tagsets. Typically, one wants to define a table of assignments for each part of speech or input feature:

  'CC' => ['pos' => 'conj', 'conjtype' => 'coor']

set_hash()

  my %hash = ('pos' => 'noun', 'number' => 'plur');
  $fs->set_hash (\%hash);

Takes a reference to a hash of features and their values. Sets the values of the features in this FeatureStructure. Unknown features are ignored. Known features that are not set in the hash will be (re-)set to empty values.

merge_hash_hard()

  my %hash = ('pos' => 'noun', 'number' => 'plur');
  $fs->merge_hash_hard (\%hash);

Takes a reference to a hash of features and their values. Sets the values of the features in this FeatureStructure. Unknown features are ignored. Known features that are not set in the hash will be left untouched; this is the difference from set_hash(). However, if the current value of a feature is non-empty and the hash contains a different non-empty value, the current value will be replaced by the one from the hash.

merge_hash_soft()

  my %hash = ('pos' => 'noun', 'number' => 'plur');
  $fs->merge_hash_soft (\%hash);

Takes a reference to a hash of features and their values. Sets the values of the features in this FeatureStructure. Unknown features are ignored. Known features that are not set in the hash will be left untouched; this is the difference from set_hash(). Known features that are set both in the hash and in this feature structure will be merged into a list of values (any single value will occur at most once). This is the difference from merge_hash_hard().

clear()

A generic setter that clears the value of a feature, i.e. removes the feature from the feature structure. All the following statements do the same thing:

  $fs->clear ('pos');
  $fs->clear_pos();
  $fs->set ('pos', '');
  $fs->set ('pos', undef);
  $fs->set_pos ('');
  $fs->set_pos (undef);

We can also clear several features at once:

  $fs->clear ('pos', 'prontype', 'gender');

get_nonempty_features()

Returns a list of names of features whose values are not empty.

  my @features = $fs->get_nonempty_features();
  my @values   = map { $fs->get_joined ($_) } @features;

The features are returned in a pre-defined (but not alphabetical) order.

get()

A generic getter for any feature. These two statements do the same thing:

  $pos = $fs->get ('pos');
  $pos = $fs->pos();

Be warned that you can get an array reference if the feature has multiple values. It is probably better to use one of the alternative get...() functions where it is better defined what you can get.

get_joined()

Similar to get() but always returns scalar. If there is an array of disjoint values, it sorts them alphabetically and joins them using the vertical bar. Example: 'fem|masc'. The sorting makes comparisons easier; it is assumed that the actual ordering is not significant and that 'fem|masc' is identical to 'masc|fem'.

get_list()

Similar to get but always returns list of values. If there is an array of disjoint values, this is the list. If there is a single value (empty or not), this value will be the only member of the list.

Unlike in get_joined(), this method does not sort the list before returning it.

get_hash()

  my $hashref = $fs->get_hash();

Creates a hash of all non-empty features and their values. The values are identical to what the get($feature) method would return; in particular, the value may be a reference to an array.

Returns a reference to the hash.

get_other_for_tagset()

  my $other = $fs->get_other_for_tagset ('cs::pdt');

Takes a tagset id. If it matches the value of the tagset feature, returns the value of the other feature (it returns a deep copy, which the caller may freely modify). If the tagset id does not match, the method returns an empty string.

set_other_subfeature()

  $fs->set_other_subfeature ('my_weird_feature', 'my_weird_value');

Takes a non-Interset feature and its value and stores it as a subfeature of the feature other. If other is currently undefined, empty or anything else than a hash reference, the method will first create a new hash and store its reference in other, overwriting its previous value (if any).

If other is a reference to a hash of subfeatures, the method will add the new subfeature and its value to the hash. If there has been a subfeature of the same name, its value will be overwritten.

Only simple scalar values of subfeatures are assumed. It is not verified but no deep copy will be made if the value is a reference. Both the feature name and the value must be defined and non-empty, otherwise the method will do nothing.

Note that the function does not check the current value of the tagset feature. It is silently assumed that if you put anything in other, you know that this is “your” feature structure.

get_other_subfeature()

  my $value = $fs->get_other_subfeature ('cs::pdt', 'my_weird_feature');

Takes a tagset id and name of a non-Interset feature, stored as a subfeature of other. The other feature may have arbitrary values ranging from plain scalars to references to multi-level nested structures of hashes and arrays. This method focuses on the case that other contains a single-level hash. The hash keys can be seen as names of additional features that are otherwise not available in Interset. These additional features are subfeatures of other and their values are strings. This is one of the most useful ways of deploying the other feature.

If the given tagset id matches the value of the tagset feature, and the value of other is a hash reference, the method uses the subfeature name as a key to the hash. If there is a value stored under the hash, it returns the value. (In case the value is not a string but a reference, a deep copy is created.) Otherwise it returns the empty string.

is_other()

  if ($fs->is_other ('cs::pdt', 'my_weird_tag') ||
      $fs->is_other ('cs::pdt', 'my_weird_feature', 'much_weirder_value'))
  {
      ...
  }

Takes a tagset id. If it does not match the value of the tagset feature, returns an empty string. If the tagset ids do match, the method queries the value of the other feature. Unlike get_other_for_tagset(), it does not create a deep copy of the possibly structured value. Instead, it only checks whether the feature has or contains one particular scalar value.

There are no a priori restrictions on the values of the other feature. The value can be a multi-level nested structure of hashes and arrays if necessary. However, most of the time it will be either a scalar value, or a flat (one-level) hash of feature-value pairs that cannot be stored using standard Interset features.

Besides the tagset id, this method takes one or two additional arguments. If the current value of other is scalar, the method checks whether it equals to Argument 1. If the current value is an array reference, the method checks whether the array contains Argument 1. If the current value is a hash reference, the method interprets Argument 1 as hash key and checks whether the value stored under that key equals to Argument 2.

It returns 1 when a match is found and 0 otherwise.

contains()

  $fs->set ('prontype', 'int|rel');
  if($fs->contains ('prontype', 'int'))
  {
      print("One of the possible pronominal classes for this word is 'interrogative'.\n");
  }

Takes a feature and a value. Tests whether the given value is one of the current values of the feature. This function can be used instead of simple if($fs->prontype() eq 'int') whenever we believe that arrays of values could occur.

set_upos()

Sets feature values according to a universal part-of-speech tag as defined in 2014 for the Universal Dependencies (http://universaldependencies.github.io/docs/).

get_upos(), upos()

Returns the universal part-of-speech tag as defined in 2014 for the Universal Dependencies (http://universaldependencies.github.io/docs/).

add_ufeatures()

  $fs->add_ufeatures ('Case=Nom', 'Gender=Masc,Neut');

Takes a list of feature-value pairs in the format prescribed by the Universal Dependencies (http://universaldependencies.org/), i.e. all features and values are capitalized, some features are renamed and all feature-value pairs are ordered alphabetically. Sets our feature values accordingly. Values of our features that are not mentioned in the input list will be left untouched.

This method does not complain about unknown features or values. They will be stored as subfeatures of the other feature. Hence it is possible to read the input even if it contains language-specific extensions that are not yet known to Interset.

get_ufeatures()

  my @ufpairs = $fs->get_ufeatures();
  print (join ('|', @ufpairs));

Returns the list of feature-value pairs in the format prescribed by the Universal Dependencies (http://universaldependencies.github.io/docs/), i.e. all features and values are capitalized, some features are renamed and all feature-value pairs are ordered alphabetically.

matches()

  if ($fs->matches ('pos' => 'noun', 'gender' => '!masc', 'number' => '~(dual|plur)'))
  {
      ...
  }

Tests multiple features simultaneously. Input is a list of feature-value pairs, return value is 1 if the structure matches all these values. This function is an abbreviation for a series of get_joined() calls in an if statement.

If the expected value is preceded by "!", the actual value must not be equal to the expected value. If the expected value is preceded by "~", then it is a regular expression which the actual value must match. If the expected value is preceded by "!~", then it is a regular expression which the actual value must not match.

as_string()

Generates a textual representation of the feature structure so it can be printed. Features are in a predefined (but not alphabetical) order. Complex values of the other feature are serialized in depth. If a feature has multiple values, they are sorted alphabetically and delimited by the vertical bar character. What follows is a sample output for the cs::pdt tags NNMS1-----A----, Ck-P1---------- and VpQW---XR-AA---:

  [pos="noun", polarity="pos", gender="masc", animacy="anim", number="sing", case="nom", tagset="cs::pdt"]
  [pos="adj", numtype="ord", number="plur", case="nom", tagset="cs::pdt", other={"numtype" => "suffix"}]
  [pos="verb", polarity="pos", gender="fem|neut", number="plur|sing", verbform="part", tense="past", voice="act", tagset="cs::pdt"]

as_string_conllx()

Generates a textual representation of the feature structure in the form used in the FEATS column of the CoNLL-X file format. The tagset and other features are omitted. Features are in predefined (but not alphabetical) order. If a feature has multiple values, they are sorted alphabetically and delimited by comma (because the vertical bar is used to separate features). What follows is a sample output for the cs::pdt tags NNMS1-----A----, Ck-P1---------- and VpQW---XR-AA---:

  pos=noun|polarity=pos|gender=masc|animacy=anim|number=sing|case=nom
  pos=adj|numtype=ord|number=plur|case=nom
  pos=verb|polarity=pos|gender=fem,neut|number=plur,sing|verbform=part|tense=past|voice=act

If the values of all features (including pos) are empty, the method returns the underscore character. Thus the result is never undefined or empty.

is_noun() Also returns 1 if the pos feature has multiple values and one of them is noun, e.g. if get_joined('pos') eq 'noun|adj'. Note that pronouns also have pos=noun. If you want to exclude pronouns, test is_noun() && !is_pronominal().

is_abbreviation()

is_abessive()

is_ablative()

is_absolute_superlative()

is_absolutive()

is_accusative()

is_active()

is_additive()

is_adessive()

is_adjective()

is_admirative()

is_adposition()

is_adverb()

is_affirmative()

is_allative()

is_animate()

is_antipassive()

is_aorist()

is_archaic()

is_article()

is_associative()

is_augmentative()

is_auxiliary()

is_benefactive()

is_cardinal()

is_colloquial()

is_comitative()

is_common_gender()

is_comparative()

is_conditional()

is_conjunction()

is_conjunctive()

is_construct()

is_converb()

is_coordinator()

is_count_plural()

is_dative()

is_definite()

is_delative()

is_demonstrative()

is_desiderative()

is_destinative()

is_determiner()

is_diminutive()

is_direct_voice()

is_distributive()

is_dual()

is_elative()

is_elevating()

is_equative()

is_ergative()

is_essive()

is_exclamative()

is_exclusive()

is_factive()

is_feminine()

is_finite_verb()

is_first_hand()

is_first_person()

is_foreign()

is_formal()

is_fourth_person()

is_future()

is_genitive()

is_gerund()

is_gerundive()

is_greater_paucal()

is_greater_plural()

is_habitual()

is_human()

is_humbling()

is_hyph()

is_illative()

is_imperative()

is_imperfect()

is_impersonal()

is_inanimate()

is_inclusive()

is_indefinite()

is_indicative()

is_inessive()

is_infinitive()

is_informal()

is_instructive()

is_instrumental()

is_interjection()

is_interrogative()

is_intransitive()

is_inverse_number()

is_inverse_voice()

is_iterative()

is_jussive()

is_lative()

is_locative()

is_masculine()

is_mediopassive()

is_middle_voice()

is_modal()

is_motivative()

is_multiplicative()

is_narrative()

is_necessitative()

is_negative()

is_nominative()

is_non_first_hand()

is_nonhuman()

is_neuter()

is_numeral()

is_optative()

is_ordinal()

is_participle()

is_particle()

is_partitive()

is_passive()

is_past()

is_paucal()

is_perfect()

is_personal()

is_personal_pronoun()

is_pluperfect()

is_plural()

is_polite()

is_positive()

is_possessive()

is_potential()

is_present()

is_prolative()

is_pronominal()

is_pronoun()

is_proper_noun()

is_progressive()

is_prospective()

is_punctuation()

is_purposive()

is_quotative()

is_rare()

is_reciprocal()

is_reflexive()

is_relative()

is_second_person()

is_singular()

is_specific()

is_subjunctive()

is_sublative()

is_subordinator()

is_superessive()

is_superlative()

is_supine()

is_symbol()

is_temporal()

is_terminative()

is_third_person()

is_total()

is_transgressive()

is_transitive()

is_translative()

is_trial()

is_typo()

is_verb()

is_verbal_noun()

is_vocative()

is_wh()

is_zero_person()

enforce_permitted_values()

  $fs->enforce_permitted_values ($permitted_trie);

Makes sure that a feature structure complies with the permitted combinations recorded in a trie. Takes a Lingua::Interset::Trie object as a parameter. Replaces feature values if needed. (Note that even the empty value may or may not be permitted.)

duplicate()

Returns a new Lingua::Interset::FeatureStructure object that is a duplicate of the current structure. Makes sure that a deep copy is constructed if there are any complex feature values.

FUNCTIONS

known_features()

Returns the list of known feature names in print order.

priority_features()

Returns the list of known features ordered according to their default priority. The priority is used in Lingua::Interset::Trie when one looks for the closest matching permitted structure.

known_values()

Returns the list of known values of a feature, in print order. Dies if asked about an unknown feature.

feature_valid()

Takes a string and returns a nonzero value if the string is a name of a known feature.

value_valid()

Takes two scalars, $feature and $value. Tells whether they are a valid (known) pair of feature name and value. A reference to a list of valid values is also a valid value. This function does not die when the feature is not valid.

structure_to_string()

Recursively converts a structure to a string. The string uses Perl syntax for constant structures, so it can be used in eval.

get_replacements()

  my $replacements = Lingua::Interset::FeatureStructure->get_replacements();
  my $rep_adverb = $replacements->{pos}{adv};
  foreach my $r (@{$rep_adverb})
  {
      if(...)
      {
          # This replacement matches our constraints, let's use it.
          return $r;
      }
  }

Returns the set of replacement values for the case a feature value is not permitted in a given context. It is a hash{feature}{value0}, leading to a list of values that can be used to replace the value0, ordered by priority.

iseq()

  if (Lingua::Interset::FeatureStructure->iseq ($a, $b)) { ... }

Compares two values, scalars or arrays, whether they are equal or not. Takes two parameters. Each of them can be a scalar or an array reference.

array_to_scalar_value()

Converts array values to scalars. Sorts the array and combines all elements in one string, using the vertical bar as delimiter. Does not care about occurrences of vertical bars inside the elements (there should be none anyway).

Takes an array reference as parameter. If the parameter turns out to be a plain scalar, the function just returns it.

SEE ALSO

Lingua::Interset, Lingua::Interset::Tagset

AUTHOR

Dan Zeman <zeman@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

This software is copyright (c) 2019 by Univerzita Karlova (Charles University).

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.