The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::EN::BioLemmatizer - Perl interface to the University of Colorado's BioLemmatizer

SYNOPSIS

Procedural summary:

    use Lingua::EN::BioLemmatizer qw(biolemma);
    print biolemma("phyla"), "\n";
    print biolemma("phyla", "NNS"), "\n";

    use Lingua::EN::BioLemmatizer qw(biolemma parse_response);
    my @triples = parse_response(biolemma("phyla"));

Object-Oriented summary:

    use Lingua::EN::BioLemmatizer;
    my $server = new Lingua::EN::BioLemmatizer;
    my $answer = $server->get_biolemma("phyla");
    my $answer = $server->get_biolemma("phyla", "NNS");

    use Lingua::EN::BioLemmatizer;
    my $server = new Lingua::EN::BioLemmatizer qw(parse_response);
    my @triples = parse_response( $server->get_biolemma("phyla") );

DESCRIPTION

Perl module to interface with the University of Colorado's BioLemmatizer code. Both a procedural and an OO interface are supported. Tested with Perl v5.10, v5.12, and v5.14. Will not work on earlier Perl versions, but should work with later ones.

To use this module, you must first download the BioLemmatizer jarfile from http://biolemmatizer.sourceforge.net, and then set the environment variable BIOLEMATIZER to the path of that jarfile. You also need a working Java installation. See the SourceForge documentation for any details about the BioLemmatizer itself.

Procedural Interface

The procedural interface is an easy front-end to the underlying object interface. Its advantage is simplicity. Its disadvantage is that the resources associated with the remote server, including filehandles and a lemma cache, will be held onto forever. Use the OO interface if you want normal destructor behavior to take care of that for you.

$lemma = biolemma(STRING)

Returns the raw (unparsed) response from the BioLemmatizer server for the given string. Use parse_response to parse this.

@triples = parse_response(STRING)
$aref = parse_response(STRING)

Parses response into an array of triples as subarrays. In scalar context, returns array ref to this array.

For example, given an input of:

    "name vvz NUPOS||name VBZ PennPOS||name NNS PennPOS||name n2 NUPOS"

the list-context return works like this:

    @list_of_triples = (
      ["name", "vvz", "NUPOS"],
      ["name", "VBZ", "PennPOS"],
      ["name", "NNS", "PennPOS"],
      ["name", "n2", "NUPOS"],
    );

and the scalar context-return works like this:

    $ref_to_triples = [
      ["name", "vvz", "NUPOS"],
      ["name", "VBZ", "PennPOS"],
      ["name", "NNS", "PennPOS"],
      ["name", "n2", "NUPOS"],
    ];

Object Interface

new()

Class constructor; must be called as a class method. Takes no arguments. To configure object to take non-default strings, first make class method calls to java_path, java_arg, jar_path, or jar_arg with the new strings as arguments.

get_biolemma(STRING)

Returns response from BioLemmatizer server when given a request of STRING.

command_args()

Returns all args used to start server, either as an list in list context or else as one string in scalar context. Used as an object method, returns whatever value was extant when object was constructed. Used as a class method, returns current defaults.

java_path

Returns the current path to Java, which is "java" by default; can be reset by calling as a class method with a new path before a constructor is called. Used as an object method, returns whatever value was extant when object was constructed.

jar_path

Returns the current path to the BioLemmatizer jar file, which is "BioLemmatizer_interactive.jar" by default; can be reset by calling as a class method with a new path before a constructor is called. Used as an object method, returns whatever value was extant when object was constructed.

java_args

Returns any extra args passed to the Java program, either as a list in list context or as an array ref in scalar context. Default is ("-Xmx1G", "-Dfile.encoding=utf8") but this can be reset by calling as a class method with new arguments before a constructor is called. Used as an object method, returns whatever value was extant when object was constructed.

jar_args

Returns any final args passed after the jar file, either as a list in list context or as an array ref in scalar context. Default is ("-t") but this can be reset by calling as a class method with new arguments before a constructor is called. Used as an object method, returns whatever value was extant when object was constructed.

child_pid()

Returns the pid of the BioLemmatizer server. Could be used to inspect the process status.

into_biolemmer()

(INTERNAL API) Returns the filehandle for writing to the BioLemmatizer server.

from_biolemmer()

(INTERNAL API) Returns the filehandle for reading from the BioLemmatizer server.

lemma_cache()

(INTERNAL API) Returns the hash ref used to cache the mapping of strings to lemmas.

EXAMPLES

Procedural example:

    use Lingua::EN::BioLemmatizer qw(biolemma);

    my @words = qw(these broken pieces are phyla grandchildren);
    my @pairs = ("lives NNS", "lives VBZ");

    for my $word (@words, @pairs) {
        say "$word => ", biolemma($word);
    } 

OO example:

    use Lingua::EN::BioLemmatizer;

    my @words = qw(these broken pieces are phyla grandchildren);
    my @pairs = ("lives NNS", "lives VBZ");

    # scope for private variable
    {
        my $server = new Lingua::EN::BioLemmatizer;

        for my $word (@words, @pairs) {
            say "$word => ", $server->get_biolemma($word);
        }
    }
    # server goes out of scope, so gets destroyed

ENVIRONMENT

The following environment variables are used by this module:

BIOLEMMATIZER

If set, holds the path to the BioLemmatizer jarfile. If unset, the jarfile used defaults to the file ./biolemmatizer-core-1.0-jar-with-dependencies.jar in the process's current working directory.

BUGS

None known.

RELEASE HISTORY

April 18, 2012

Initial public release.

AUTHOR

Tom Christiansen <tchrist@perl.com>

COPYRIGHT AND LICENCE

Copyright 2012 Tom Christiansen.

This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.