NAME

vocabulary -- extract vocabularies from Penn treebank files

SYNOPSIS

vocabulary [-NT ntfile] [-POS posfile] [-word wordfile] [-count] [-binarized] [-verbose] file1 [file2...]

File1, file2 etc. are the names of Penn treebank files. If none are specified, STDIN is used.

OPTIONS

NT

Write the non-terminal node vocabulary to ntfile.

POS

Write the part of speech vocabulary to posfile

word

Write the word vocabulary to wordfile.

count

Print the frequency counts for each of the categories.

binarized

The file is in binarized format.

verbose

Print filenames as they are processed.

DESCRIPTION

Given a list of Penn treebank files, this script extracts the words, parts of speech, and non-terminal node names and emits each in a separate file in order of frequency.

Note that giving a "-" argument for any of ntfile, posfile, or wordfile causes the results to be written to STDOUT.

AUTHOR

W.P. McNeill <billmcn@ssli.ee.washington.edu>