Lingua::Thesaurus - Thesaurus management
my $thesaurus = Lingua::Thesaurus->new(SQLite => $dbname); $thesaurus->load($io_class => @files); $thesaurus->load($io_class => {files => \@files, params => {termClass => .., relTypeClass => ..}});
my $thesaurus = Lingua::Thesaurus->new(SQLite => $dbname); my @terms = $thesaurus->search_terms('*foo*'); my $term = $thesaurus->fetch_term('foobar'); my $scope_note = $term->SN; my @synonyms = $term->UF; foreach my $pair ($term->related(qw/NT RT/)) { my ($rel_type, $item) = @$pair; printf " %s(%s) = %s\n", $rel_type->description, $rel_type->rel_id, $item; } # transitive search foreach my $quadruple ($term->transitively_related(qw/NT/)) { my ($rel_type, $related_term, $through_term, $level) = @$quadruple; printf " %s($level): %s (through %s)\n", $rel_type->rel_id, $level, $related_term->string, $through_term->string; }
This distribution manages thesauri. A thesaurus is a list of terms, with some relations between them (like for example "broader term" / "narrower term").
Thesauri are loaded from one or several IO formats; usually this will be the ISO 2788 format, or some derivative from it. See classes under the Lingua::Thesaurus::IO namespace for various implementations.
Once loaded, thesauri are stored via a storage class; this is meant to be an efficient internal structure for supporting searches. Currently, only Lingua::Thesaurus::Storage::SQLite is implemented; but the architecture allows for other storage classes to be defined, as long as they comply with the Lingua::Thesaurus::Storage role.
Terms are retrieved through the "search_terms" and "fetch_term" methods. The results are instances of Lingua::Thesaurus::Term; these objects have navigation methods for retrieving related terms.
This distribution was originally targeted for dealing with the Swiss thesaurus for justice "Jurivoc" (see Lingua::Thesaurus::IO::Jurivoc). However, the framework should be easily extensible to other needs. Other Perl modules for thesauri are briefly discussed below in the "SEE ALSO" section.
Side note: another motivation for writing this distribution was also to experiment with Moose meta-programming possibilities. Subclasses of Lingua::Thesaurus::Term are created dynamically for implementing relation methods NT, BT, etc. --- see Lingua::Thesaurus::Storage source code.
NT
BT
Caveat: at the moment, IO classes only implement loading and searching; methods for editing and dumping a thesaurus will be added in a future version.
my $thesaurus = Lingua::Thesaurus->new($storage_class => @storage_args);
Instanciates a thesaurus on a given storage. The $storage_class will be automatically prefixed by Lingua::Thesaurus::Storage::, unless the classname contains an initial '+'. The remaining arguments are transmitted to the storage class. Since Lingua::Thesaurus::Storage::SQLite is the default storage class supplied with this distribution, thesauri are usually opened as
$storage_class
Lingua::Thesaurus::Storage::
'+'
my $dbname = '/path/to/some/file.sqlite'; my $thesaurus = Lingua::Thesaurus->new(SQLite => $dbname);
$thesaurus->load($io_class => @files); $thesaurus->load($io_class => {files => \@files, params => {termClass => .., relTypeClass => ..}});
Populates a thesaurus database with data from thesauri dumpfiles. The job of parsing these files is delegated to some IO subclass, given as first argument. The $io_class will be automatically prefixed by Lingua::Thesaurus::IO::, unless the classname contains an initial '+'. The remaining arguments are transmitted to the IO class; the simplest form is just a list of dumpfiles. See IO subclasses in the Lingua::Thesaurus::IO namespace for more details.
IO
$io_class
Lingua::Thesaurus::IO::
my @terms = $thesaurus->search_terms($pattern);
Searches the term database according to $pattern, where the pattern may contain '*' to mean word completion.
$pattern
'*'
The interpretation of patterns depends on the storage engine; by default, this is implemented using SQLite's "LIKE" function (see http://www.sqlite.org/lang_expr.html#like). Characters '*' in the pattern are translated into '%' for the LIKE function to work as expected.
'%'
It is also possible to configure the storage to use fulltext searches, so that a pattern such as 'sci*' would also match 'computer science'; see "use_fulltext" in Lingua::Thesaurus::Storage::SQLite.
'sci*'
'computer science'
If $pattern is empty, the method returns the list of all terms in the thesaurus.
Results are instances of Lingua::Thesaurus::Term.
my $term = $thesaurus->fetch_term($term_string);
Retrieves a specific term and returns an instance of Lingua::Thesaurus::Term (or undef if the term is unknown).
undef
Returns the list of ids of relation types stored in this thesaurus (i.e. 'NT', 'RT', etc.).
my $rel_type = $thesaurus->fetch_rel_type($rel_type_id);
Returns the Lingua::Thesaurus::RelType object corresponding to $rel_type_id.
$rel_type_id
Returns the internal object playing role Lingua::Thesaurus::Storage.
More details can be found in the various implementation classes :
Lingua::Thesaurus::IO : Role for input/output operations on a thesaurus
Lingua::Thesaurus::IO::ISO2788 : IO class for ISO thesauri (not implemented yet)
Lingua::Thesaurus::IO::Jurivoc : IO class for "Jurivoc", the Swiss thesaurus for justice
Lingua::Thesaurus::IO::LivelinkCollectionServer : IO class for Livelink Collection Server thesaurus files
Lingua::Thesaurus::RelType : Relation type in a thesaurus
Lingua::Thesaurus::Storage: Role for thesaurus storage
Lingua::Thesaurus::Storage::SQLite: Thesaurus storage in an SQLite database
Lingua::Thesaurus::Term: parent class for thesaurus terms
Here is a brief review of some other thesaurus modules on CPAN :
Thesaurus has several backend implementations (CSV, BerkeleyDB, DBI), but it just handles synonyms (a single relation between terms).
Text::Thesaurus::ISO is quite old (1998), uses obsolete technology (dbmopen), and has a fixed number of relations, some of which are apparently targeted to the specific needs of UK electronic libraries.
dbmopen
Biblio::Thesaurus has a rich set of features, not only for reading and searching, but also for editing and exporting a thesaurus. Storage is directly in hashes in memory; those can be saved into files in Storable format. The set of relations is flexible; it is read from the ISO dumpfiles. If it fits directly your needs, it's probably a good choice; but if you need to adapt/extend it, it's not totally obvious because all features are mingled into one monolithic module.
Biblio::Thesaurus::SQLite has an unclear status : it sits in the same namespace as Biblio::Thesaurus, and actually calls it in the source code, but doesn't inherit or call it. A separate API is provided for storing some thesaurus data into an SQLite database; but the full features of Biblio::Thesaurus are absent.
Laurent Dami, <dami at cpan.org>
<dami at cpan.org>
Please report any bugs or feature requests to bug-lingua-thesaurus at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lingua-Thesaurus. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-lingua-thesaurus at rt.cpan.org
You can find documentation for this module with the perldoc command.
perldoc Lingua::Thesaurus
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Lingua-Thesaurus
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/Lingua-Thesaurus
CPAN Ratings
http://cpanratings.perl.org/d/Lingua-Thesaurus
Search MetaCPAN
https://metacpan.org/module/Lingua::Thesaurus
Copyright 2013 Laurent Dami.
This program is free software; you can redistribute it and/or modify it under the terms of the the Artistic License (2.0). You may obtain a copy of the full license at:
http://www.perlfoundation.org/artistic_license_2_0
The test suite contains a short excerpt from the Swiss Jurivoc thesaurus, copyright 1999-2012 Tribunal fédéral Suisse (see http://www.bger.ch/fr/index/juridiction/jurisdiction-inherit-template/jurisdiction-jurivoc-home.htm).
- support for multiple thesauri files (a term belongs to one-to-many thesaurus files; a relation belongs to exactly one thesaurus file)
- use_unaccent without fulltext ==> use collation sequence or redefine LIKE - store thesaurus name for each term => adapt search_terms($pattern, $thes_name);
To install Lingua::Thesaurus, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::Thesaurus
CPAN shell
perl -MCPAN -e shell install Lingua::Thesaurus
For more information on module installation, please visit the detailed CPAN module installation guide.