Ted Pedersen




A list and description of the measures of association included in NSP


1) ll.pm Log-Likelihood Ratio

This is the among the more widely used tests for finding strongly associated bigrams. The following provides a theoretical argument in its favor:

        author = {Dunning, T.},
        title = {Accurate Methods for the Statistics of
                        Surprise and Coincidence},
        journal = {Computational Linguistics},
        volume = {19},
        number = {1},
        year = {1993},
        pages = {61-74}}

2) ll3.pm Log likelihood Ratio for Trigrams

This is our only test for trigrams. It extends directly from ll.pm.

3) pmi.pm Pointwise Mutual Information

Widely used, maybe not the best choice however. The Manning and Schutze textbook, "Foundations of Statistical Natural Language Processing", MIT Press, presents various of the arguments against its use.

4) dice.pm Dice Coefficient

Very closely related to PMI. Has some of the same drawbacks. Nice general discussion of this measure can be found here:

@article{SmadjaMH96, author = {Smadja, F. and McKeown, K. and Hatzivassiloglou, V.}, title = {Translating Collocations for Bilingual Lexicons: A Statistical Approach}, journal = {Computational Linguistics}, volume = {22}, number = {1}, year = {1996}, pages = {1-38}}

5) leftFisher.pm Fisher's Exact Test (left sided)

Docs/FAQ.txt discusses Fisher's Exact Test in some depth. You can see a more detailed treatment of it at:

        author = {Pedersen, T.},
        title = {Fishing For Exactness},
        booktitle = {Proceedings of the South Central SAS User's
                Group (SCSUG-96) Conference},
        year = {1996},
        pages = {188--200},
        month ={October},
        address = {Austin, TX}}

Available from http://www.d.umn.edu/~tpederse/pubs.html

6) rightFisher.pm Fisher's exact test (right sided)

Essentially a mirror image of the left sided test. The left sided test is recommended over the right for identifying significant bigrams. See Docs/FAQ.txt for a discussion of this point.

7) x2.pm Pearson's Chi-Squared Test

There are very good reasons to fear using tests of association with collocation data, since the counts are both large and skewed. One sanity check you can make is to compare the scores found by ll.pm and x2.pm. If they are not too different from one another, then you are probably not violating any (too many?) asymptotic assumptions. If they do diverge quite a bit, then you may want to consider an exact test. How can you tell if they diverge? Use rank.pl!

rank-script.sh ll x2 input-file

will produce a correlation score. If that value is high then you can be fairly confident that things are ok and your tests (either ll or x2 are valid).

8) tmi.pm "True" Mutual Information

This is very closely related to ll.pm, and essentially only differs by a scaling factor. Note that the values produced by tmi.pm are very small (.0000...) so you'll need to use more than the default level of precision (which is 4 digits). Consider --precision 8, for example.

9) phi.pm Phi Coefficient (2 variables - bigrams)

This implementation is based on the description of Phi in:

        author = {Gale, W. and Church, K.},
        title = {A Program for Aligning Sentences in Bilingual Corpora},
        booktitle = {Proceedings of the 29th Annual Meeting of the
                Association for Computational Linguistics},
        address = {Berkeley, CA},
        year = {1991}}

If the table is:

 n11 n12 | n1p
 n21 n22 | n2p
 np1 np2   npp

It is defined as:

 ((n11 * n22) - (n21 * n22))^2/ n1p * np1 * n2p * np2

10) tscore.pm t-Score

This implementation is based on the description of the t-score in:

 @incollection {ChurchGHH91,
        author={Church, K. and Gale, W. and Hanks, P. and Hindle, D. },
        title={Using Statistics in Lexical Analysis},
        booktitle={Lexical Acquisition: Exploiting On-Line Resources
                        to Build a Lexicon},
        editor={Zernik, U.},
        address={Hillsdale, NJ},
        publisher={Lawrence Erlbaum Associates}}

If the table is:

 n11 n12 | n1p
 n21 n22 | n2p
 np1 np2   npp

It is defined as :

 n11 - m11/sqrt (n11)

where m11 = n1p * np1/npp

In words, this means the observed frequency of the bigram minus the expected count of the bigram, divided by the square root of the observed value.

11) odds.pm Odds Ratio (2 variables - bigrams)

Widely used in many realms, not so much in finding collocations. Essentially takes the ratio of the cross products of the elements in a 2-d table. If the table is:

n11 n12 n21 n22

the odds ratio = n11*n22/n21*n12


 Satanjeev Banerjee (dice, ll, pmi, leftFisher, x2)  bane0025@d.umn.edu
 Amruta Purandare (ll3, tmi, tmi3)                   pura0010@d.umn.edu
 Ted Pedersen (odds, tscore, rightFisher, phi)       tpederse@d.umn.edu

 Date of last update,  July 25, 2003 by TDP

We welcome additional contributions - please check out Docs/NewStats.txt for information on how to implement measures.



 home page:    http://www.d.umn.edu/~tpederse/nsp.html

 mailing list: http://groups.yahoo.com/group/ngram/


Copyright (C) 2003 Ted Pedersen, Satanjeev Banerjee, and Amruta Purandare

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

Note: a copy of the GNU Free Documentation License is available on the web at http://www.gnu.org/copyleft/fdl.html and is included in this distribution as FDL.txt.