TODO - List of things TODO for SenseClusters
Plans for future versions of SenseClusters
Add support for ngram cluster labeling to web interface?
Resolve Testing error in wordvec.pl, and also deprecated use of defined in keyconvert.pl. These are known issues in 1.03.
Consider the use of SVDLIBC rather than SVDPACKC for SVD, however, have been having problems compiling SVDLIBC on 64 bit platforms.
Testing of SVDPACK remains problematic, since results can vary from platform to platform. At present we are simply testing to see if output is created.
Introduction of CPAN style 'make test' option. Can be coded even for command line interfaces, but can be a little messy, especially when a program is not producing a single value but rather tables of values or formatted text (which is often what we do). Must also do this in such a way that it handles different file system (via File::Spec probably).
Improve order 1 efficiency. Rather than matching each context against every feature (regular expression), match each possible feature in context with the features. The simple approach for invoking count.pl for each context by creating a temporary file for each context suffers from file and process creation overhead. Need to investigate NSP APIs and use them so that features can be identified from contexts in memory instead of having to create temporary files.
Provide simple tools that allow a user to visualize more complicated data structures like the 2nd order context vectors. For example, right now a user can see the word vectors associated that will be averaged together (in the .wordvec file), but they are purely numeric. It would be nice to see what words are associated with these values.
Make discriminate.pl more modular in its organization, possibly through the use of subroutines. Reduce reliance on system calls which can lead to portability problems.
Fix --showargs option on discriminate.pl. Has not been working for many versions now.
Check to see if checking return values from vcluster and scluster in discriminate.pl is really accomplishing anything. Do they return error codes and success codes reliably? Any chance for false positive or false negative?
Replace default stoplist with a program that automatically generates stoplists from a given corpus.
Add version information to SenseClusters web interface, including version of SenseClusters and modules it is using.
There are various small utility programs whose return codes are checked by discriminate. These programs seem to always return via exit, suggesting that their return codes are always for success. May want to return a different value for failures so the discriminate.pl checks are meaningful.
Reduce the number of regular expressions used in the regex file provided to order2vec.pl for feature identification during LSA style context clustering. This is required if we adhere to the nsp2regex.pl approach for feature identification. Right now regexes are generated from training data based on all the features found in the training data. These are given as input to order1vec.pl to identify features from test data. The number of features identified form test data can be less than the number of regexes created from training data (i.e. some regexes may not match anything in the test data). Currently this same regex file is given as input to order2vec.pl in LSA context clustering mode. So we also need to additionally provide a feature file to order2vec.pl specifying what features were actually found in test data ($PREFIX.features_in_testdata file created by discriminate.pl). If we create a regex file corresponding to just the regexes that matched at least once in the test data, then just this new regex file can be provided as input to order2vec.pl --featregex option. This change needs to be done in order1vec.pl (just as currently it prints clabels selectively for only those features that were found in the test data, it can create a new regex file containing just the regexes that matched at least once in the test data). An additional FEATURE file will then no more be required by order2vec.pl in LSA context clustering mode (the FEATURE file will still be required in SC native word or context clustering).
Wherever possible and appropriate, add the error checks from discriminate.pl to the actual programs that require that error check. For example: Check for 0 zero features in order1vec.pl, wordvec.pl and order2vec.pl
Check why the criterion function values as different across platform (Linux vs. Solaris). Currently the test cases for clusterstopping.pl used platform dependent checking - can it be made platform independent?
The idea of global training data is to have one large file of plain text that is used as a source of training data for multiple target word discrimination problems. maketarget.pl used to produce regexes of the form /(line)/, presumably to be used to identify target words in plain corpora (where no head tags have been inserted). Does SenseClusters support the identification of target word features under these circumstances (where there is no head tag)? If so, we should adjust maketarget.pl so that it continues to produce target regexes of this form. As of 0.95 it only produces them with the head tag. Right now discriminate.pl insists that training data have a head tag in it to find tco. It seems like you should still be able to try and find tco features if you have specified a plain text regex such as we have above.
The gap statistic generates expected values for randomly created matrices that have the same marginal totals as the observed data. In some cases the expected values (for a criterion function) are actually greater than the observed, which suggests that the random data is in fact benefiting more from the clustering than the observed data. It isn't clear why the expected values aren't always less than the observed, since random data when clustered should not really get better and better criterion function scores as the number of clusters increases, or at least these should not be greater or better than the observed values.
- Reorganize package, move to CPAN, make installation easier via a Bundle, general clean up releases. (good start made)
Web Interface (done)
- 1. Time outs (done)
Currently if the given data or combinations of options leads to longer than usual processing time then the web-interface just hangs and does not give back any links to the results even if the process has finished. Investigate if the problem is with request time-out and find the solution for this problem.
- 2. Scope Options (done)
The current version of web-interface does not support the scope_train and scope_test options. Implement the same.
- 3. Format option (done)
Also the format option is not available. Implement the same.
- 4. Taint flag (done)
Understand and implement the taint mode.
Labeling Clusters (done)
Discovered clusters will be labeled with their most discriminating features or with the actual dictionary glosses. This will indicate which clusters represent which word senses.
svdpackout (continuing issue)
There remain problems with svdpackout test cases A1g, A1h, and A2 on Solaris. These are due to precision issues with 64 bit CPUs, and related issues, I think.
SVD on Order1 Type vectors (done)
Omit null columns from the order1 context vectors as created by order1vec. This is causing a problem for mat2harbo.pl. Hence, SVD fails on order1 type vectors at present.
Order1 Efficiency (continuing issue)
Order1 in its current form is way too slow even for few thousand contexts and features. It needs to be improved speed wise.
Warnings in Demo (done)
Demo script makedata.sh shows Malformed UTF-8 warnings. Senseval-2 data needs to be cleaned or programs preprocess.pl and prepare_sval2.pl need to be modified to avoid these encoding warnings.
Multi-Lexelt (on hold)
preprocess.pl (and setup.pl that uses preprocess.pl) currently splits given DATA on lexelts. Allow multiple lexelt functionality by modifying preprocess.pl or using some other techniques ...
Amruta has some mixml scripts that can be made distributable to handle this multi-lexelt issue. These scripts concatenate two xml files and create a single xml file. These are useful to combine the split lexelt results from setup/preprocess into a single multi-lexelt file and also for supporting experiments on multiple-lexelts as done in CONLL 2004.
Installation (on hold)
Update Makefile to check if SVDPACK, CLUTO are installed. If not, warn users that options like --svd or cluto can't be used. Auto-download and install CPAN modules like PDL, Bit::Vector.
One idea would be to bundle and redistribute all the required modules and programs from other packages into SenseClusters' tarball and install them like regular SenseClusters' code.
Include questions that would be interesting to general user community.
Fuzzy Feature matching (still a good idea)
Perl supports fuzzy pattern matching ... might be useful in our feature regex matching !
Technical report for SenseClusters - once the sparse matrix support is available. (great idea)
Ted Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu Amruta Purandare University of Pittsburgh Anagha Kulkarni Carnegie-Mellon University
Copyright (C) 2005-2008 Ted Pedersen, Amruta Purandare, Anagha Kulkarni
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
Note: a copy of the GNU Free Documentation License is available on the web at http://www.gnu.org/copyleft/fdl.html and is included in this distribution as FDL.txt.