windower.pl - Limit window of context around a target word specified in a Senseval-2 input file
Suppose we have a very small Senseval-2 file (small-test.xml) with just 2 instances. We would like to limit the surrounding context to 5 words to the left and 5 words to the right of the target word:
windower.pl small.xml 5
<?xml version="1.0" encoding="iso-8859-1" ?> <corpus lang='english' tagged="NO"> <lexelt item="begin.v"> <instance id="begin.555"> <answer instance="begin.555" senseid="begin%2:30:01::"/> <context> greats hardly knowns and unknowns <head>begin</head> a game three month season </context> </instance> <instance id="begin.557"> <answer instance="begin.557" senseid="begin%2:30:01::"/> <context> late november it expects to <head>begin</head> construction by year end and </context> </instance> </lexelt> </corpus>
This is from the first two lines of the file begin.v-test.xml. You can see the full contexts at /samples/Data.
windower.pl --help for a quick summary of options
Limits the contexts of given instances to W tokens around the target word.
windower.pl [OPTIONS] SVAL2 W
SVAL2 must be a tokenized and preprocessed instance file in the Senseval-2 format.
Should be a positive integer number specifying the window size. windower will display only the tokens that appear in the window of [-W, +W] centered around the target word.
Output will be displayed in plain text format showing context of each instance on a single separate line. i.e. each i'th line on stdout will show the context of the i'th instance in the given SVAL2 file. By default, output is created in Senseval-2 format.
TOKENREGEX should be a file containing Perl regular expressions that define the tokenization scheme in SVAL2. windower recognizes only those character sequences from SVAL2 that match the specified token regex/s, everything else will be ignored. If --token is not specified, windower searches the default token.regex file in the current directory.
Specify a file containing Perl regular expressions that define the target word/s. Target words must be valid tokens recognizable by the specified tokenization scheme (via --token or token.regex)
Following are some of the examples of TARGET word regex files -
which specifies that the target word could be
line, Line, lines or Lines
delimited in <head> and </head> tags.
Above regex can also be specified as multiple regexes in TARGET as -
/<head>line<\/head>/ /<head>lines<\/head>/ /<head>Line<\/head>/ /<head>Lines<\/head>/
with a single regex per line
shows a more general regex for target words marked in <head> tags
Shows the regex for matching target words in the original Senseval-2 data.
shows that any occurrence of words - Line, line, Lines, lines are target words (that are not delimited in any special tags).
Other Options :
Displays this message.
Displays the version information.
When --plain is not selected, OUTPUT is in Senseval-2 format that looks same as the input SVAL2 file except the context of each instance shows atmost W words around the target word.
When --plain is ON, OUTPUT shows each context on a single line i.e. context of i'th instance in the given SVAL2 file is shown on the i'th line on stdout.
Amruta Purandare, University of Pittsburgh
Ted Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu
Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.