The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::NL::FactoidExtractor - A tool for extracting factoids from Dutch texts

SYNOPSIS

    use strict;
    use lib "./lib";
    use Lingua::NL::FactoidExtractor;

    my $inputfile = "alpino.xml";
    my $verbose = 1; #boolean
    my $factoids = extract($inputfile,$verbose);

    print "$factoids\n";

PREREQUISITES

The Dutch parser Alpino is a prerequisite for this module. Alpino is available under the conditions of the Gnu Lesser General Public License. See The Alpino Home Page.

DESCRIPTION

However, around 30% of the clauses in Wikipedia are passive clauses, and in many cases a person is referred to by a pronoun. We want to ensure that "A number of family members were painted by Rembrandt" gives the same factoid as "Rembrandt painted a number of family members" and that for "Rembrandt painted Biblical scenes" the same factoid is generated as for "Rembrandt, who painted Biblical scenes". For cases like these, our factoid extractor performs a number of transformations to the input clauses. We implemented the following transformations:

  • Passive-to-active: Passive clauses are transformed to active clauses, in which the subject from the passive clause takes the object position. If there is no actor in the sentence, the subject slot is filled with the empty actor 'MEN' (ONE).
    "De luchthaven werd op 8 juli 1964 geopend"
    The airport was opened on July 8th, 1964
    MEN|open|de luchthaven|op 8 juli 1964
  • Modifier-to-subject: If a passive clause contains a modifier starting with 'door' (by) then this modifier is moved to the subject slot, e.g.
    "De instrumenten werden opnieuw ingespeeld door de bandleden"
    The instruments were recorded again by the band members"
    de bandleden|speel_in|de instrumenten|opnieuw
  • Copula-to-definition: If the verb of a clause is a copular verb (e.g. become), then the object of the clause is considered to be a description of the subject. These factoids are transformed to definitions with the verb IS.
    "Rome werd opnieuw de hoofdstad van Italië
    Rome became the capital of Italy again
    Rome|IS|de hoofdstad van Italië|opnieuw
  • Double-object-to-definition: For clauses that have two objects, a factoid is generated that connects both objects, e.g.
    "De behandeling van Crohn wordt symptomatisch genoemd"
    The treatment of Crohn's disease is called symptomatic
    de behandeling van Crohn|IS|symptomatisch|
  • Pron-to-np: If the subject or object of a clause is a relative pronoun, then we substitute it by the most recent noun phrase. This is a very local form of anaphora resolution.
    "De voornaamste vertegenwoordiger was Rembrandt, die veel Bijbelse taferelen schilderde."
    The main representative was Rembrandt, who painted many Biblical scenes.
    de voornaamste vertegenwoordiger|IS|Rembrandt
    Rembrandt|schilder|veel Bijbelse taferelen|


For sentences that consist of multiple clauses, multiple factoids are generated, e.g.

"Voor de onafhankelijkheid was Bangalore een belangrijke industriestad; meer recent is het een belangrijk centrum van de informatietechnologie in India geworden en wordt het wel de Silicon Valley van India genoemd."
Before its independence, Bangalore was an important industry town; more recently it became an important centre of information technology in India and it is called the Silicon Valley of India.

Bangalore|IS|een belangrijke industriestad|Voor de onafhankelijkheid
het|IS|een belangrijk centrum van de informatietechnologie in India|meer recent
MEN|noem|het & de Silicon Valley van India|meer recent & wel
het|IS|de Silicon Valley van India

KNOWN ISSUES

If punctuation such as a full stop or a comma is glued to a word in the Alpino output then this punctuation also ends up in the factoids extracted from the sentence. Work-around is to use a tokenizer that separates punctuation from words by whitespace before parsing the sentence.

AUTHOR

Suzan Verberne, http://sverberne.ruhosting.nl

COPYRIGHT AND LICENSE

Copyright (C) 2012 by Suzan Verberne

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.7 or, at your option, any later version of Perl 5 you may have available.

CREDITS

This work was funded by Google by means of a European Digital Humanities Award.