The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

nlquestion2sparqlquery - Perl script for converting Natural Language Questions in SPARQL queries

SYNOPSIS

nlquestion2sparqlquery [option] --input <FILENAME>

OPTIONS AND ARGUMENTS

  • --input=filename, -i filename

    This option defines the input file to load. If the filename is - (or the option is not specified), the input data is read on STDIN.

  • --output <filename>

    This option defines the output file to load. If the filename is - (or the option is not specified), the output data is print on STDOUT.

  • --rcfile=file, -c file

    Load the given configuration file.

  • --answer, -a

    This option specifies if the answers are returned (otherwise, the SPARQL query is returned)

  • --format [XML|SPARQL], -f [XML|SPARQL]

    This option defines the format of the output:

    • XML: the output is in XML, as required by the QALD challenge

    • SPARQL: the output is the SPARQL query or the list of answers

  • --help

    Print help message for using nlquestion2sparqlquery

  • --man

    Print man page of nlquestion2sparqlquery

  • --verbose, -v

    Go into the verbose mode. Note that the verbosity can be increased by using several times the option.

  • --debug, -D

    Switch in debug mode for the script nlquestion2sparqlquery (the switch has no influence on the object code).

DESCRIPTION

This script aims at querying RDF knowledge base with questions expressed in Natural language. Natural language questions are converted in SPARQL queries. The method is based on rules and resources. Resources are provided for querying the Drugbank (<http://www.drugbank.ca>), Diseasome (<http://diseasome.eu>) and Sider (<http://sideeffects.embl.de>).

The Natural language question has been already annotated with linguistic and semantic information. Input file provides this information (see details regarding the format in the section INPUT FORMAT).

If you use this software, please cite:

Natural Language Question Analysis for Querying Biomedical Linked Data Thierry Hamon, Natalia Grabar, and Fleur Mougin. Natural Language Interfaces for Web of Data (NLIWod 2014). 2014. To appear.

EXAMPLES of USE

Tu run the script, a configuration file is needed (usually nlquestion.rc in /etc/nlquestion - see section CONFIGURATION FILE FORMAT for more details. An example of the configuration file is available in etc/nlquestion/nlquestion.rc from the archive directory.

  • The most common command line to run nlquestion2sparqlquery is

    nlquestion2sparqlquery -i example1.qald

    It is assumed that the directory containing the program nlquestion2sparqlquery is in your PATH variable and that the configuration file is /etc/nlquestion/nlquestion.rc.

    The SPARQL query is printed on the STDOUT in QALD XML format.

  • If you are not allow to copy the configuration file nlquestion.rc in the directory /etc/nlquestion (or create this directory), or if you want to use your own configuration file, you can specify the file with its path by using the option --rcfile

    nlquestion2sparqlquery --rcfile nlquestion2.rc -i example1.qald

  • you can also change the format and record the results in a file

    nlquestion2sparqlquery --rcfile nlquestion2.rc -i example1.qald -f SPARQL -a -o example1.out

INPUT FORMAT

The input file is composed of several parts providing linguistic and semantic information on the natural language question:

  • the identifier of the question is introduced by DOC: on one line. For instance:

     DOC: question1

    The end of the information associated to the document is marked by the keyword _END_DOC_ .

  • the definition of the language of the question is defined with language: on one line. For instance:

     language: EN
  • the list of the sentence(s) is introducted by the keyword sentence: and ends with the keyword _END_SENT_ (both in one line). For instance:

     sentence:
     Which diseases is Cetuximab used for?
     _END_SENT_
  • the morpho-syntactic information associated to each word is introduced by the keyword word information: ends with the keyword _END_POSTAG_ (both in one line). Each line contains 4 information separated by tabulations: the inflected form of the word, its part-of-speech tag, its lemma and its offset (in number of characters). For instance:

     word information:
     Which  WDT     which   10      
     diseases       NNS     disease 16      
     is     VBZ     be      25      
     Cetuximab      VBN     Cetuximab       28      
     used   VBN     use     38      
     for    IN      for     43      
     ?      SENT    ?       46      
     _END_POSTAG_
  • the semantic entities and associated semantic information is introduced by the keyword semantic units: ends with the keyword _END_SEM_UNIT_ (both in one line). Each line contains 5 information separated by tabulations: the semantic entity, its canonical form, its semantic types (separated by column), its start offset and its end offset (in number of characters). For instance:

     semantic units:
     # term form<tab>term canonical form<tab>semantic features<tab>offset start<tab>offset end (ended by _END_SEM_UNIT_)
     diseases       diseas  disease:disease 16      23
     Cetuximab      Cetuximab       drug/drugbank/gen/DB00002:drug/drugbank/gen/DB00002     28      36
     used for       used for        possibleDrug:possibleDrug       38      45
     Cetuximab      Cetuximab       drug/drugbank/gen/DB00002:drug/drugbank/gen/DB00002     28      36
     diseases       diseas  disease:disease 16      23
     used for       used for        possibleDrug:possibleDrug       38      45
     _END_SEM_UNIT_

    Semantic types can be decomposed in subtypes. They are coded in the same way as a unix file path.

NB: Comments are introduced by the character #. Empty lines are ignored.

Examples of files are available in the example of the archive.

CONFIGURATION FILE FORMAT

The configuration file format is similar to the Apache configuration format. The module Config::General is used to read the file. There are sections named NLQUESTION for each language (identified with the attribute language). Each section defines the following variables defining the behaviour of the script:

  • VERBOSE: it defines the verbose mode level similarly to the option --verbose. It is overwritten by this option.

  • REGEXFORM: this boolean variable indicates if in case of use of regex, the inflected form (value 1) or canonical form (value 0) is used.

  • UNION: this boolean variable indicates if the union is used or not

  • SEMANTICTYPECORRESPONDANCE: this variable defines the file containing the semantic information (rewriting rules, semantic correspondance, etc.) to generate the SPARQL queries

  • URL_PREFIX: it specifies the begining of the URL (before the SPARQL query) when the query is sent to a virtuoso server.

  • URL_SUFFIX: it specifies the end of the URL (before the SPARQL query) when the query is sent to a virtuoso server.

SEE ALSO

QALD challenge web page: <http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/index.php?x=task2&q=4>

Natural Language Question Analysis for Querying Biomedical Linked Data Thierry Hamon, Natalia Grabar, and Fleur Mougin. Natural Language Interfaces for Web of Data (NLIWod 2014). 2014. To appear.

AUTHOR

Thierry Hamon, <hamon@limsi.fr>

COPYRIGHT AND LICENSE

Copyright (C) 2014 Thierry Hamon

This is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.