NAME
vocabulary -- extract vocabularies from Penn treebank files
SYNOPSIS
vocabulary [-NT ntfile] [-POS posfile] [-word wordfile] [-count] [-binarized] [-verbose] file1 [file2...]
File1, file2 etc. are the names of Penn treebank files. If none are specified, STDIN is used.
OPTIONS
- NT
-
Write the non-terminal node vocabulary to ntfile.
- POS
-
Write the part of speech vocabulary to posfile
- word
-
Write the word vocabulary to wordfile.
- count
-
Print the frequency counts for each of the categories.
- binarized
-
The file is in binarized format.
- verbose
-
Print filenames as they are processed.
DESCRIPTION
Given a list of Penn treebank files, this script extracts the words, parts of speech, and non-terminal node names and emits each in a separate file in order of frequency.
Note that giving a "-" argument for any of ntfile, posfile, or wordfile causes the results to be written to STDOUT.
AUTHOR
W.P. McNeill <billmcn@ssli.ee.washington.edu>