vocabulary -- extract vocabularies from Penn treebank files
vocabulary [-NT ntfile] [-POS posfile] [-word wordfile] [-count] [-binarized] [-verbose] file1 [file2...]
File1, file2 etc. are the names of Penn treebank files. If none are specified, STDIN is used.
Write the non-terminal node vocabulary to ntfile.
Write the part of speech vocabulary to posfile
Write the word vocabulary to wordfile.
Print the frequency counts for each of the categories.
The file is in binarized format.
Print filenames as they are processed.
Given a list of Penn treebank files, this script extracts the words, parts of speech, and non-terminal node names and emits each in a separate file in order of frequency.
Note that giving a "-" argument for any of ntfile, posfile, or wordfile causes the results to be written to STDOUT.
W.P. McNeill <billmcn@ssli.ee.washington.edu>
To install Lingua::Treebank, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::Treebank
CPAN shell
perl -MCPAN -e shell install Lingua::Treebank
For more information on module installation, please visit the detailed CPAN module installation guide.