NAME
analyze - TF-IDF Analyze a corpus
SYNOPSIS
perl analyze --dir=/some/corpus [options]
perl analyze --dir=/Users/you/Documents/lit/inaugural --top=5 --proper
perl analyze --dir=/Users/you/Documents/lit/inaugural --phrase='public good'
perl analyze --dir=/Users/you/Documents/lit/inaugural --dir=/Users/you/Documents/lit/SOTU --top=5
perl analyze --dir=/Users/you/Documents/lit/Shakespeare --size=3 --top=5
perl analyze --dir=/Users/you/perl5/lib/site_perl/Music --size=1 --type=pm --punc=0
DESCRIPTION
This program analyzes the given corpus with the TF-IDF measure for ngrams.
OPTIONS
- --help
-
Brief help message.
- --man
-
Full documentation.
- --dir
-
Corpus list of text documents.
- --size
-
Ngram phrase size.
Default:
2
- --top
-
Show the top N ngrams seen.
Default:
0
- --stop
-
Constrain the ngrams by excluding stopwords.
Default:
1
- --phrase
-
Search the corpus for the phrase and its IF-IDF values.
Default:
''
- --type
-
Read corpus files of this file extension.
Default:
txt
- --punc
-
A string defining a regular expression of characters to exclude from results. Giving
0
as the value for this will not exclude any characters.Default: (?!')[[:punct:]]
Note that the single quote is not excluded by default.
- --lc
-
Lower-case the results.
Default:
0
- --proper
-
Skip phrases with capitalized words (often proper nouns).
Default:
0