The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Dezi::Tutorial - getting started with the Dezi search platform

Installation

Install the Dezi server from CPAN:

 % cpan -i Dezi
 

Install the Dezi client from CPAN:

 % cpan -i Dezi::Client

Beginner - Hello World

Start the Dezi server:

 % dezi

In a separate terminal, add a small test document to the index:

 % echo '<doc><title>bar</title>hello world</doc>' > test.xml
 % dezi-client test.xml

Search the index to confirm your test document worked:

 % dezi-client -q bar
 

Intermediate - The Dezi Demo

The Intermediate tutorial details the specifics behind the Dezi demo available at http://dezi.org/demo.

Download the Reuters corpus

The Reuters News Corpus for Text Classification (Reuters-21578) is a common document corpus used for information retrieval projects. Other document collections have become more popular since the Reuters corpus first appeared (e.g. Wikipedia database) but the Reuters corpus is a nice, medium sized collection for demonstrating Dezi.

You can find the corpus many places on the internet. The version used for the demo came from http://svn.peknet.com/search_bench/. The 2xml.pl script at that URL will convert the original SGML documents to valid XML and split them into about 21k individual documents.

Unpack the tar.gz file somewhere and run the 2xml.pl script as described in the script's comments.

Create a Swish3 configuration file

As described in Dezi::Architecture, Dezi is based on Swish3 http://swish3.dezi.org/. You can index the Reuters corpus with the deziapp command that comes with Dezi::App (one of the Dezi dependencies).

First, you'll need a configuration file. Here's the one used for the Dezi demo:

 DefaultContents XML*
 StoreDescription XML* <text> 10000
 PropertyNameAlias swishtitle title
 MetaNames dates topics people places orgs author swishdocpath
 PropertyNames dates topics people places orgs author dateline
 FuzzyIndexingMode Stemming_en1

Save the file as dezi.conf.

More details on Swish3 configuration can be found at http://swish-e.org/docs/swish-config.html.

Index the XML

If your Reuters docs are in a directory called reuters, you can create an index with a command like:

 % deziapp -c dezi.conf -i reuters
 

You can index all kinds of document types, not just XML, but for the purposes of this tutorial, we'll keep it simple.

Create a Dezi configuration file

Here's the contents of the demo config file, named dezi.config.pl:

 {
    engine_config => {
        facets => { 
            names => [qw( topics people places orgs author )] 
        },
    },
    ui_class    => 'Dezi::UI',
    base_uri    => 'http://dezi.org/demo',
    username    => 'deziuser',
    password    => 'a-secret',
 }

NOTE that the username/password is there to prevent unwanted modification of the index. Since Dezi supports POST, PUT and DELETE HTTP actions on an index, it's a good idea to protect an index, particularly if it is on the open internet.

NOTE too the Dezi::UI class is enabled. That requires a separate installation from CPAN.

 % cpan -i Dezi::UI

Start the Dezi server

 % dezi --dezi-config dezi.config.pl

From a separate terminal, you can search the index containing text from the Reuters corpus:

 % dezi-client -q 'some words'

Thanks to the Dezi::UI module, you can also search via a web browser. Assuming you are running the demo on a local machine, you can point your browser at http://localhost:5000/ui and explore the index contents graphically.

Advanced - Roll Your Own

Write your own client application

 % cat indexer.pl

 #!/usr/bin/env perl
 use strict;
 use warnings;
 
 use Dezi::Client;
 use File::Find;
 
 my $client = Dezi::Client->new( 
    server => 'http://localhost:5000' 
 );
 
 find({ 
    wanted      => \&add_to_index, 
    follow      => 1, 
    no_chdir    => 1,
 }, @ARGV);
 
 my $resp = $client->commit();

 print $resp->content;

 sub add_to_index {
    my $file = $File::Find::name;
    
    # we only want .xml files
    return unless $file =~ m/\.xml$/;
    
    my $resp = $client->index($file);
    if (!$resp->is_success) {
        die "Failed to index $file: " . $resp->status_line;
    }
 }

Start your Dezi server

 % dezi 

Run your indexer

In a separate terminal:

 % perl indexer.pl path/to/xml/docs

Search with dezi-client

After you're done indexing, look for something:

 % dezi-client -q foo

AUTHOR

Peter Karman, <karman at cpan.org>

BUGS

Please report any bugs or feature requests to bug-dezi at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Dezi. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find this documentation with the perldoc command.

    perldoc Dezi::Tutorial

You can also look for information at:

COPYRIGHT & LICENSE

Copyright 2011 Peter Karman.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.

SEE ALSO

Dezi::Client, Search::OpenSearch, SWISH::3, Dezi::App, Plack, Lucy