The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

analyze - TF-IDF Analyze a corpus

SYNOPSIS

  perl analyze --dir=/some/corpus [options]

  perl analyze --dir=/Users/you/Documents/lit/inaugural --top=5 --proper
  perl analyze --dir=/Users/you/Documents/lit/inaugural --phrase='public good'
  perl analyze --dir=/Users/you/Documents/lit/inaugural --dir=/Users/you/Documents/lit/SOTU --top=5
  perl analyze --dir=/Users/you/Documents/lit/Shakespeare --size=3 --top=5
  perl analyze --dir=/Users/you/perl5/lib/site_perl/Music --size=1 --type=pm --punc=0

DESCRIPTION

This program analyzes the given corpus with the TF-IDF measure for ngrams.

OPTIONS

--help

Brief help message.

--man

Full documentation.

--dir

Corpus list of text documents.

--size

Ngram phrase size.

Default: 2

--top

Show the top N ngrams seen.

Default: 0

--stop

Constrain the ngrams by excluding stopwords.

Default: 1

--phrase

Search the corpus for the phrase and its IF-IDF values.

Default: ''

--type

Read corpus files of this file extension.

Default: txt

--punc

A string defining a regular expression of characters to exclude from results. Giving 0 as the value for this will not exclude any characters.

Default: (?!')[[:punct:]]

Note that the single quote is not excluded by default.

--lc

Lower-case the results.

Default: 0

--proper

Skip phrases with capitalized words (often proper nouns).

Default: 0