The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

text_compare.pl - Measure the similarity between files or strings

SYNOPSIS

 text_compare.pl --type Text::Similarity::Overlaps --normalize --string '.......this is one' '????this is two' 

 text_compare.pl --type Text::Similarity::Overlaps --no-normalize --string '.......this is one' '????this is two' 

 text_compare.pl --type Text::Similarity::Overlaps --string 'sir winston churchill' 'Churchill, Winston Sir' 

 text_compare.pl --type Text::Similarity::Overlaps ../GPL.txt ../FDL.txt

 text_compare.pl --verbose --type Text::Similarity::Overlaps ../GPL.txt ../FDL.txt 

 text_compare.pl --verbose --stoplist stoplist.txt --type Text::Similarity::Overlaps ../GPL.txt ../FDL.txt 

 text_compare.pl [[--verbose] [--stoplist=FILE] [--no-normalize] [--string]] --type=TYPE | --help | --version] FILE1 FILE2

DESCRIPTION

This script is a simple command-line interface to the Text::Similarity Perl modules. At present only one method of computing similarity is provided, Text::Similarity::Overlaps. However, additional methods can be added. The output described below for this program comes from Text::Similarity::Overlaps, but could vary in future as additional similarity measurement methods are added.

By default Text::Similarity::Overlaps returns a normalized F-measure between 0 and 1. Normalization can be turned off by specifying --no-normalize.

In addition, it can return the cosine, E-measure, precision, and recall when used in the verbose mode (specify --verbose in the command line).

 precision = raw_score / length_file_2
 recall = raw_score / length_file_1
 F-measure = 2 * precision * recall / (precision + recall)
 E-measure = 1 - F-measure
 cosine = raw_score / sqrt (precision + recall) 

Files are treated as one long line of text.

There is some cleaning of text performed automatically, which includes removal of most punctuation except embedded apostrophies and underscores. All text is made lower case. This occurs both for file and string input.

OPTIONS

--type=TYPE

The type of text similarity measure. Valid values include:

    Text::Similarity::Overlaps
--stoplist=FILE

The name of a file containing stop words (one word per line).

--no-normalize

Do not normalize scores. Normally, scores are normalized so that they range from 0 to 1. Using this option will give you a raw score instead.

--string

Input will be provided on the command line as strings, not files.

--verbose

Show all the matches that are found between the files, their length and frequency, as well as precision, recall, F-measure, E-measure, and cosine.

--help

Show a detailed help message.

--version

Show version information.

AUTHORS

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

 Jason Michelizzi

Last modified by: $Id: text_compare.pl,v 1.15 2008/04/04 18:30:17 tpederse Exp $

BUGS

--compfile is not working, seems to cause hang (tdp 3/21/08)

COPYRIGHT AND LICENSE

Copyright (C) 2004-2008, Jason Michelizzi and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA