The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::TreeTagger::Filter - handle a Lingua::TreeTagger output

VERSION

Version 0.01

SYNOPSIS

  use Lingua::TreeTagger::Filter;

  # Create a tree tagger object.
  my $tagger = Lingua::TreeTagger->new(
      'language' => 'english',
      'options'  => [qw( -token -lemma -no-unknown )],
  );
  
  my $text = "This is a test";
  
  # Tag some text and get a new TaggedText object.
  my $tagged_text = $tagger->tag_text( \$text );
  
  # Create a filter object.
  my $filter = Lingua::TreeTagger::Filter->new();
  
  # This filter will extract all sequences starting with tag DT...
  $filter->add_element( tag => 'DT', );
  
  # ... followed by any sequence up to 4 elements...
  $filter->add_element( 'quantifier' => '{0,4}', );
  
  # ... and followed by another lemma than "house".
  $filter->add_element( lemma => '!house', );
  
  # Alternatively use the init_with_string method (see documentation below)
  $filter->init_with_string("tag=DT#quantifier={0,4}#lemma=!house");
  
  # Apply the filter to the taggedtext.
  my $result = $filter->apply($tagged_text);
  
  # Display the matching sequences.
  print( $result->as_text() );
  
  # Extract bi-grams.
  $result = $filter->extract_ngrams( $tagged_text, 2 );
  
  # Display the bi-grams.
  print( $result->as_text() );
  
  # Makes a substitution.
  # Initialize the filter...
  $filter->init_with_string("tag=NN sub_tag=Test");
  
  # ...use the substitute() method
  $result = $filter->substitute($tagged_text);
  
  # ... display the result.
  print( $result->as_text() );
  
  

DESCRIPTION

This module allows you to search or to modify sequences ( tag, original, lemma) in a Lingua::TreeTagger::Taggedtext object (output by Lingua::Treetagger).

METHODS

new()

You can use this constructor in two different ways

empty filter

Initialize a new, empty filter. It doesn't need any attributes.

init_with_string filter

by using the constructor with a string argument, you can directly define the filter sequence. For more informations about the syntax see documentation of the init_with_string() method. Example my $string = "tag=NN#quantifier={0,4}#lemma=!house"; my $filter = Lingua::TreeTagger::Filter->new( $string );

is equivalent to: my $filter = Lingua::TreeTagger::Filter->new(); $filter->init_with_string( "tag=NN#quantifier={0,4}#lemma=!house" );

apply()

Takes one required argument, a Lingua::TreeTagger::TaggedText object. Optionaly you can give a second argument with the value of 0.

This method is the core of this distribution. It applies the filter on a taggedtext and returns the matching sequences.

Each attribute of the filter element are inserted in a regular expression. The comparison is made as follows:

taggedtext_element =~ /^element_filter$/

to take a lesser abstract example:

    current token original attribute = house

    current filter element original attribute = house

    gives:

    house =~ /^house$/ -> match

So you can use all the possibilities offered by Perl regular expressions. By default the attribute is inserted between anchors to ensure a strict match. If you don't want to use anchors, you can define a parameter anchor_attribute with the value of 0. For example, for the original value "house".

  original        => house,
  anchor_original => 0

gives the result

  something =~/house/

So it will match "house" but any word containing "house" too, as "housewife" or "housekeper".

This comparison is made for each attributes (original, tag, lemma) of the current filter element and the current token, if there is a match between every attributes --> match between current filter element and current token (the default attribute value "." match everything).

If you want to negate an element, you can do it by adding a "!" in the front of the attribute. This will modify the comparison as follows taggedtext_element !~ /element_filter/

You can modify the negation symbol if it's necessary. For further details see the documentation of the add_element() method.

To obtain precision on the attributes of a filter element,please refer to the documentation of the add_element() method.

The return value is a Lingua::TreeTagger::Filter::Result object.

samples:

      $filter->add_element (
        tag => 'DT',
      );
      $filter->add_element (
        original => 'house',
      );

    This filter will match all the sequences in the taggedtext which are made up of a determiner preceding the word "house" (for example the sequence: "a house" will be matched). To match something larger, modify the second element as follows:

      $filter->add_element (
        tag => '!DT',
      );
      $filter->add_element (
        original => 'house',
      );

    This filter will match all the sequences of word in which any word but a determiner preceeds a word containing "house" as "nice housekeeper".

    Lets focus on the quantifier role:

      $filter->add_element (
        quantifier => '+',
        tad        => 'DT',
      );
      $filter->add_element (
        original => 'house',
      );

    This filter will match all the sequences which begin by one or more determiner and immediately followed by the word "house". You also can use the other quantifier symbols of the language ( ". ", " ?", " * ")

    You can define intervals:

      $filter->add_element (
        quantifier => '{1,3}',
        tag        => 'DT',
      );
      $filter->add_element (
        original => 'house',
      );

    This filter will match all the sequences begining by up to 3 determiners and followed by the word "house". You can also use the syntax {,3} (0 to 3) and {2,}(2 to infinity).

apply_no_overlap()

This method is a extension of the apply basic method. In the classic method, after a match, the filter continues his scan from the second element of the last matched sequence. With the no overlap apply the scan continues from the next element after the last matched sequence. With this second method, an element cannot be part of two more matching sequences.

It take one attribute as the apply method, a taggedtext.

substitute()

It takes one required argument, a Lingua::TreeTagger::TaggedText object.

This method is a prolongation of the the apply() method. For each matching sequence the filter will makes a second passage, each defined sub_attribute will replace the corresponding attribute of the original token. Example:

Original token:

  original = is
  tag      = VBZ
  lemma    = be

Filter Element:

  original = "."
  tag      = VBZ
  lemma    = "."
  sub_tag  = TEST

New token after substitution

  original = is
  tag      = TEST
  lemma    = be
  

Example:

This text "this is a test" gives this tagged sequence

  this    DT      this
  is      VBZ     be
  a       DT      a
  trial   NN      trial

if you use the substitute method with a filter with this unique element:

  $filter->add_element(
    tag     => 'NN',
    sub_tag => 'Test',
  );

gives this new tagged sequence

  this    DT      this
  is      VBZ     be
  a       DT      a
  trial   Test    trial

This method creates a copy of the taggedtext object so it conserves the original sequence in the original object. The new sequence is stored in the returned object.

if you don't define any sub_attribute, the method still runs and you will obtain a new object with the same sequence.

add_element()

Adds element to the sequence. It can be an existing Lingua::TreeTagger::Filter::Element object or you can create a new one using this method.

This method takes named parameters.

position

an optional intenger, specifying where the element should be added in the sequence. If not defined the element will be added at the end of the filter sequence (as a PUSH))

existing element
element_object

optional, an existing Lingua::TreeTagger::Filter::Element

new element object

    All the parameters in this section must be omitted if the parameter element_object is defined. If element_object is not defined and a parameter (tag, original, lemma) is omitted, the value of the corresponding attribute in the corresponding filter element will be initiated to "." that implies a match of everything.

original

optional, a string containing the expression to be compared with the original attribute from the tokens of a Lingua::TreeTagger output.

tag

optional, a string containing the expression to be compared with the tag attribute from the tokens of a Lingua::TreeTagger output.

lemma

optional, a string containing the expression to be compared with the lemma attribute from the tokens of a Lingua::TreeTagger output.

quantifier

optional, a string, must respect the syntax of Perl quantifiers.

samples: +/*/?/{n}/{n,m}

The quantifier defines the number of repetitions of the current element in the filter sequence.

If element_object is not defined and quantifier is omitted, the value of quantifier attribute in the corresponding filter element will be initiated to 1(the correpsonding element must appear exactly one time)

By default the quantifier are greedy (here the definition is a litte different as in the perl regular expression. Here greedy works stricly with the next element and not with the whole expression. The element will match as many element as possible ensuring the match of the next element and doesn't ensure the match of the whole sequence For example this filter:

  tag=DT#quantifier=*#lemma=house#quantifier=*#lemma=house

Won't match this text: "This is a house, a nice house" because the three first element matched from "a" to the second "house")

The quantifiers "+" and "*" can be use in a ungreedy way. By adding a "?" directly afer it, the quantifier becomes ungreedy, that means that he will try to match the current element as long as the next element does'nt match. For example this filter:

  tag=DT#quantifier=*?#lemma=house
  

with this text: "This is a great house, a nice house". Will sent back: [ "a great house", "a nice house"] (greedy version (put back)[ "a great house, a nice house", "a nice house" ])

By default the attributes (tag, original, lemma) are inserted between anchors in the regular expression (see the documentation of apply() method for further details). If you want to avoid anchors you have to define the corresponding "anchor_attribute" parameter with the value of 0 (Caution: 0 is the only accepted value!).

anchor_original

optional, 0, an int

anchor_tag

optional, 0, an int

anchor_lemma

optional, 0, an int

For these attributes (tag, original, lemma), you can define a corresponding "sub_attribute" which will be used by the substitute() method. See method documentation for further informations.

sub_original

optional, a string

sub_lemma

optional, a string

sub_tag

optional, a string

You can negate an attribute by adding "!" in front of it sample: "tag=!ADJ" -> signifies that it will match any non-adjective token. You can change this symbol by defining the "neg_symbol" parameters.

neg_symbol

optional parameter, is a string containing a unique symbol which is used to negate an assertion. This:

  neg_symbol => '?' 

implies that in the current element the default negation symbol ("!") is replaced by "?"

sample:

    add_element (case of a new object ) with all parameters explicit (excepted "anchor_original" and "anchor_lemma")

      $filter -> add_element (
          lemma         => 'be',
          original      => 'is',
          tag           => 'VBZ',
          anchor_tag    => 0,
          sub_tag       => 'Test',
          sub_original  => 'Test',
          sub_lemma     => 'Test',
          quantifier    => '1',
          position      => 1,
      );
remove_element()

This method allows you to remove an element from the Filter.It requires one argument.

an int, defines the index of the element to remove.

init_with_string()

This method allows you to write a filter in a one line instruction. This function needs a string as argument.

Be careful, if you have any elements in your filter at this time, they will be deleted! Use add_element() if you want to complete a filter.

syntax is:

"#" separates the elements (Lingua::TreeTagger::Filter::Element object) begining the sequence with a "#" signifies that your filter begins with an element matching any token (wildcard). "something# #something" in the line signifies that the method will insert a wildcard at the place of the space. Inside an element, syntax is as following:

to define an attribute:

  attributes_name=value (no space must be inserted)

example: "tag=NN" -> will initiate the value of tag for the corresponding element to "NN"

to separate the attributes:

  attributes_name1=value1 attributes_name2=value2

the space is used to separate attributes (that explain why space is forbidden just above) example: "tag=NN original=house" will initiate the value of tag for the corresponding element to "NN" and the value of original for the same element to "house"

As in the simple method add_element(), you can negate an attribute by adding "!" in front of it example: "tag=!JJ" -> signifies that it will match any non-adjective token

You can also modify this symbol by defining the parameter "neg_symbol" "tag=?JJ neg_symbol=?" is equivalent to "tag=!JJ"

Samples:

    sample 1:

     $filter->init_with_string("tag=NN original=house#tag=!JJ")

    is equivalent to:

      $filter->add_element(
        tag      => 'NN',
        original => 'house',
      );
      
      $filter->add_element(
        tag => '!JJ',
      );

    sample 2:

      $filter->init_with_string("lemma=be# #tag=!DT")

    is equivalent to:

      $filter->add_element(
        lemma => 'be',
      );
      
      $filter->add_element();
      
      $filter->add_element(
        tag => '!DT',
      );

    sample 3:

      $filter->init_with_string("#tag=!DT")
      

    is equivalent to:

      $filter->add_element();
      
      $filter->add_element(
        tag => '!DT',
      );
extract_ngrams()

This method allows you to extract n-grams, it requires two attributes

The first one is a Lingua::TreeTagger::TaggedText object (the tagged text in which you want to extract the ngrams).

The second one is the length of the sequence. 2 to extract 2-grams, 3 to extract 3-grams...

This method return a Lingua::TreeTagger::Filter::Result object

ACCESSORS

get_sequence()

Read-only accessor for the 'sequence' attribute of a Filter object.

DIAGNOSTICS

apply()
Attempt to call apply() without any arguments

This exception is raised by the apply() method when the user doesn't give any argument.

Attempt to call apply() without a tagged text object

This exception is raised by the apply() method when the user gives an an argument which is not a Lingua::TreeTagger::TaggedText object.

Attempt to call apply() with an empty filter

This exception is raised by the apply() method when the user tries to call the method with an empty filter (a filter without any filter element).

Attempt to call apply() with an empty tagged text

This exception is raised by the apply() method when user gives an empty tagged text as argument.

substitute()
Attempt to call substitute() without any arguments

This exception is raised by the substitute() method when the user doesn't give any argument.

Attempt to call substitute() without a taggedtext_object

This exception is raised by the substitute() method when the user gives an argument which is not a Lingua::TreeTagger::TaggedText object.

Attempt to call substitute() with an empty filter

This exception is raised by the substitute() method when the user tries to call the method with an empty filter (a filter without any filter element).

Attempt to call substitute() with an empty tagged text

This exception is raised by the substitute() method when user gives an empty tagged text as argument.

add_element()
Attempt to call add_element() with incorrect element object

This exception is raised by the add_element() method when the user gives an incorrect value for the element_object parameter (should be an Lingua::TreeTagger::Filer::Element object).

Attempt to call add_element() with a non-numeric argument

This exception is raised by the add_element() method when the user gives an non-numerical value for the position parameter.

remove_element()
remove_element(), out of index

This exception is raised by the remove_element() method when the user gives an index values which is not part or directly after the sequence

Attempt to call remove_element() without argument

This exception is raised be the remove_element() method when the user doesn't give any argument.

Attempt to call remove_element() with a non-numeric argument

This exception is raised by the remove_element() method when the user gives an non-numerical as argument.

the asked element is not part of the sequence

This exception is raised by the remove_element() method when the user gives an index values which is not part the sequence

init_with_string()
Attempt to call init_with_string() without argument

This exception is raised be the init_with_string() method when the user doesn't give any argument.

extract_ngrams()
Attempt to call extract_ngrams() without argument

This exception is raised be the extract_ngrams() method when the user doesn't give any argument.

Attempt to call extract_ngrams() without a tagged text object

This exception is raised by the extract_ngrams() method when the user gives a first argument which is not a Lingua::TreeTagger::TaggedText object.

Attempt to call extract_ngrams() with an empty tagged text

This exception is raised by the extract_ngrams() method when user gives an empty tagged text as argument as first argument.

Attempt to call extract_ngrams() with a non-numerical argument

This exception is raised by the extract_ngrams() method when the user gives an non-numerical as argument.

CONFIGURATION AND ENVIRONMENT

For the configuration and environnement, please refer to the documentation of the required module Lingua::TreeTagger by Aris Xanthos. You will find there further informations to install and run TreeTagger.

DEPENDENCIES

This is the base module of the Lingua::TreeTagger::Filter distribution. It uses modules Lingua::TreeTagger::Filter::Element, Lingua::TreeTagger::Filter::Result, and Lingua::TreeTagger::Filter::Result::Hit.

This module requires module Lingua::TreeTagger. It is really thought to work together, several fonctionnalities or part of this documentation are directly issued from this distribution.

This module requires module Moose and was developed using version 1.09. Please report incompatibilities with earlier versions to the author.

BUGS AND LIMITATIONS

There are no known bugs in this module.

Please report problems to Benjamin Gay (Benjamin.Gay@unil.ch)

Please note that this distribution is still a beta version. I think that the basical cases are pretty well tested. I tested the content the best as I can but I fear that there is still some bugs. The matter was really hard to test for me. So please report any bugs if you find one or more of them.

Patches are welcome.

ACKNOWLEDGEMENTS

The author is grateful to Aris Xanthos for his leading in the realization of this project.

Thanks to Leonard Gay for his useful and quick feedback.

AUTHOR

Benjamin Gay, <Benjamin.Gay at unil.ch>

LICENSE AND COPYRIGHT

Copyright 2011 Benjamin Gay.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.