Author image Ted Pedersen
and 1 contributors

NAME - driver for running wsd experiments

SYNOPSIS {--type=MEASURE | --sense1 | --random} --basename=outputfile {--semcor DIR | --file FILE [FILE ...]} [--config=FILE] [--window=INT] [stoplist=FILE] [--contextScore NUM] [--pairScore NUM] [--forcepos][--nocompoundify][--usemono][--score][--backoff] | {--help | --version}

EXAMPLE --type='WordNet::Similarity::lesk' --basename='test-output' --file=br-a01 --window=2


This script is used for running wsd experiments with different parameters. Given the similarity measure and the input file/directory, the key file is created by calling The key file is then sorted on columns using

SemCor sense tagged files are reformatted for use by using Then is called to disambiguate the text. The disambiguated text is reformatted using The answer file is created by sorting this text on columns.

Finally, the answer file is scored against the key file using script which is modeled after the scorer2 C program (

Note that doesn't need the key and answer files to be sorted. However, the scorer2 C program needs the input to be sorted. So the files are sorted in case you want to use scorer2 C program to compare the results.


N.B., the = sign between the option name and the option parameter is optional.


The relatedness measure to be used. The default is WordNet::Similarity::lesk.


WordNet sense 1 disambiguation guesses that the correct sense for each word is the first sense in WordNet because the senses of words in WordNet are ranked according to frequency. The first sense is more likely than the second, the second is more likely than the third, etc. --sense1 --basename='test-output' --file=br-a01

If you are using this option, don't use --type option or --random option.


Random selects one of the possible senses of the target word randomly. --random --basename='test-output' --file=br-a01

If you are using this option, don't use --type option or --sense1 option.


The basename for the output files. creats a number of output files, the key file, the answer file, the result file etc.

For example for the following command, --type='WordNet::Similarity::lesk' --basename='test-output' --file=br-a01

since the basename is test-output, it will create test-output.key, test-output.out and where test-output.key is the key file, test-output.out is the answer file and is the trace file.

The final output is also displayed on standard output.


The location of the SemCor directory. This directory will contain several sub-directories, including 'brown1' and 'brown2'. Do not specify these sub-directories. Only specify the directory name that contains them. For example, if /home/user/semcor3.0 contains the brown1 and brown2 directories, you would only specify /home/user/semcor3.0 as the value of this option. Do not use this option at the same time as the --file option. --type='WordNet::Similarity::lesk' --basename='test-output' --semcor=/home/user/semcor3.0


One or more semcor-formatted files to process. This can be used instead of the previous option to only specify a few Semcor files or to specify Senseval files. When this option is used, multiple files can be specified on the command line. For example --type='WordNet::Similarity::lesk' --basename='test-output' --file br-a01 br-a02 br-k18 br-m02 br-r05

Do not attempt to use this option when using the previous option.


The name of a configuration file for the specified relatedness measure.


A file containing regular expressions (as understood by Perl), surrounded by by slashes (e.g. /\d+/ removes any word containing a digit [0-9]). Any word in the text to be disambiguated that matches one of the regular expressions in the file is removed. Each regular expression must be on its own line, and any trailing whitespace is ignored.

Care must be taken when crafting a stoplist. For example, it is tempting to use /a/ to remove the word 'a', but that expression would result in all words containing the lowercase letter a to be removed. A better alternative would be /\ba\b/.


Defines the size of the window of context. The default is 4. A window size of N means that there will be a total of N words in the context window, including the target word. If N is a (positive) even number, then there will be one more word on the left side of the target word than on the right.

For example, if the window size is 4, then there will be two words on the left side of the target word and one on the right. If the window is 5, then there will be two words on each side of the target word.

The minimum window size is 2. A smaller window would mean that there were no context words in the window.


If no sense of the target word achieves this minimum score, then no winner will be projected (e.g., it is assumed that there is no best sense or that none of the senses are sufficiently related to the surrounding context). The default is zero.


The minimum pairwise score between a sense of the target word and the best sense of a context word that will be used in computing the overall score for that sense of the target word. Setting this to be greater than zero (but not too large) will reduce noise. The default is zero.


Turn part of speech coercion on. POS coercion attempts to force other words in the context window to be of the same part of speech as the target word. If the text is POS tagged, the POS tags will be ignored. POS coercion may be useful when using a measure of semantic similarity that only works with noun-noun and verb-verb pairs.


Disable compoundifying. By default compoundifying is enabled. Using this option will disable it.


If this flag is on the only available sense is assignsed to the monosemy words. By default this flag is off.


Use the most frequent sense if the measure can't assign sense because no relatedness is found with the surrounding words. This happens for path based measures and Info content based measures.


Score only specific instances. Valid options are

--score poly score only polysemes instances --score s1nc score only the instances where the most frequent sense is not correct --score n score only the instances having n number of sense


 Jason Michelizz

 Varada Kolhatkar, University of Minnesota, Duluth
 kolha002 at

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at

This document last modified by : $Id:,v 1.17 2009/05/19 21:59:24 kvarada Exp $


 L<> L<> L<> 


Copyright (c) 2009, Varada Kolhatkar, Ted Pedersen, Jason Michelizzi

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

Note: a copy of the GNU Free Documentation License is available on the web at and is included in this distribution as FDL.txt.