Measures.pod
A list and description of the measures of association included in NSP
This is the among the more widely used tests for finding strongly associated bigrams. The following provides a theoretical argument in its favor:
@article{Dunning93, author = {Dunning, T.}, title = {Accurate Methods for the Statistics of Surprise and Coincidence}, journal = {Computational Linguistics}, volume = {19}, number = {1}, year = {1993}, pages = {61-74}}
This is our only test for trigrams. It extends directly from ll.pm.
Widely used, maybe not the best choice however. The Manning and Schutze textbook, "Foundations of Statistical Natural Language Processing", MIT Press, presents various of the arguments against its use.
Very closely related to PMI. Has some of the same drawbacks. Nice general discussion of this measure can be found here:
@article{SmadjaMH96, author = {Smadja, F. and McKeown, K. and Hatzivassiloglou, V.}, title = {Translating Collocations for Bilingual Lexicons: A Statistical Approach}, journal = {Computational Linguistics}, volume = {22}, number = {1}, year = {1996}, pages = {1-38}}
Docs/FAQ.txt discusses Fisher's Exact Test in some depth. You can see a more detailed treatment of it at:
@inproceedings{Pedersen96, author = {Pedersen, T.}, title = {Fishing For Exactness}, booktitle = {Proceedings of the South Central SAS User's Group (SCSUG-96) Conference}, year = {1996}, pages = {188--200}, month ={October}, address = {Austin, TX}}
Available from http://www.d.umn.edu/~tpederse/pubs.html
Essentially a mirror image of the left sided test. The left sided test is recommended over the right for identifying significant bigrams. See Docs/FAQ.txt for a discussion of this point.
There are very good reasons to fear using tests of association with collocation data, since the counts are both large and skewed. One sanity check you can make is to compare the scores found by ll.pm and x2.pm. If they are not too different from one another, then you are probably not violating any (too many?) asymptotic assumptions. If they do diverge quite a bit, then you may want to consider an exact test. How can you tell if they diverge? Use rank.pl!
rank-script.sh ll x2 input-file
will produce a correlation score. If that value is high then you can be fairly confident that things are ok and your tests (either ll or x2 are valid).
This is very closely related to ll.pm, and essentially only differs by a scaling factor. Note that the values produced by tmi.pm are very small (.0000...) so you'll need to use more than the default level of precision (which is 4 digits). Consider --precision 8, for example.
This implementation is based on the description of Phi in:
@inproceedings{GaleC91, author = {Gale, W. and Church, K.}, title = {A Program for Aligning Sentences in Bilingual Corpora}, booktitle = {Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics}, address = {Berkeley, CA}, year = {1991}}
If the table is:
n11 n12 | n1p n21 n22 | n2p --------- np1 np2 npp
It is defined as:
((n11 * n22) - (n21 * n22))^2/ n1p * np1 * n2p * np2
This implementation is based on the description of the t-score in:
@incollection {ChurchGHH91, author={Church, K. and Gale, W. and Hanks, P. and Hindle, D. }, title={Using Statistics in Lexical Analysis}, booktitle={Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon}, editor={Zernik, U.}, year={1991}, address={Hillsdale, NJ}, publisher={Lawrence Erlbaum Associates}}
It is defined as :
n11 - m11/sqrt (n11)
where m11 = n1p * np1/npp
In words, this means the observed frequency of the bigram minus the expected count of the bigram, divided by the square root of the observed value.
Widely used in many realms, not so much in finding collocations. Essentially takes the ratio of the cross products of the elements in a 2-d table. If the table is:
n11 n12 n21 n22
the odds ratio = n11*n22/n21*n12
Satanjeev Banerjee (dice, ll, pmi, leftFisher, x2) bane0025@d.umn.edu Amruta Purandare (ll3, tmi, tmi3) pura0010@d.umn.edu Ted Pedersen (odds, tscore, rightFisher, phi) tpederse@d.umn.edu Date of last update, July 25, 2003 by TDP
We welcome additional contributions - please check out Docs/NewStats.txt for information on how to implement measures.
home page: http://www.d.umn.edu/~tpederse/nsp.html mailing list: http://groups.yahoo.com/group/ngram/
Copyright (C) 2003 Ted Pedersen, Satanjeev Banerjee, and Amruta Purandare
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
Note: a copy of the GNU Free Documentation License is available on the web at http://www.gnu.org/copyleft/fdl.html and is included in this distribution as FDL.txt.
To install Text::NSP, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::NSP
CPAN shell
perl -MCPAN -e shell install Text::NSP
For more information on module installation, please visit the detailed CPAN module installation guide.