NAME

data-freq - a text frequency analysis tool

SYNOPSIS

    data-freq [options] [--] [files..]

OPTIONS

Field Type

    -t | --text           -y | --year
    -u | --number         -m | --month
    -d | --date | --day   --hour --minute --second
    
    +%m-%d +%H +%H:%M etc. | --strftime=FMT

For multiple fields, each Field Type Option begins new field specification.

Field Selector
```
    -p NUM | --pos=NUM
```
NUM can be zero, positive, negative, multiple separated by commas (,), and/or a range with a .. operator.

Field Output

    -n NUM | --limit=NUM    -z | --zero
    -o NUM | --offset=NUM

NUM can be zero, positive, or negative.

Field Aggregation

    -U | --unique   -M | --max   -N | --min   -Y | --average

Field Sorting

    -V | --value   -F | --first   -A | --asc
    -S | --score   -L | --last    -D | --desc

Input Format
```
    -b STR | --split=STR
```

Output Format

    -I STR | --indent=STR      -R | --root
    -P STR | --prefix=STR      -T | --transpose
    -B STR | --separator=STR   -O | --nopadding

Help

    -v | --version   -h | --help   -a | --man   -c | --check

EXAMPLES

Monthly view counts

    Long:  data-freq --month < access_log
    Short: data-freq -m < access_log

Monthly + Daily

    Long:  data-freq --month --day < access_log
    Short: data-freq -md < access_log

Monthly + Top 3 users per month

    Long:  data-freq --month \
                     --text --pos=2 --limit=3 \
                     access_log
    Short: data-freq -m -tp2 -n3 access_log

Top 5 days in the number of distinct users

    Long:  data-freq --day --score --limit=5 \
                     --text --pos=2 --unique --zero \
                     access_log
    Short: data-freq -dS -n5 -tp2 -Uz access_log

Hourly aggregation

    Long:  data-freq --strftime %H
    Short: data-freq +%H

DESCRIPTION

Overview

data-freq is a command line tool to analyze frequency of particular types of text data. It is based on the corresponding Perl module Data::Freq.

For example, consider an input file:

    Abc Def
    Def Ghi
    Ghi Jkl
    Abc Def
    Def Ghi
    Abc Def

The command can be executed as below:

    data-freq filename
    (or)
    data-freq < filename

Then the output will be

    3: Abc Def
    2: Def Ghi
    1: Ghi Jkl

where the number on the left indicates how many times each line of text appears in the input.

Log file analysis

This tool is designed especially in favor of log file analysis.

A typical log file for the Apache web server consists of lines like this:

    1.2.3.4 - user1 [01/Jan/2012:01:02:03 +0000] "GET / HTTP/1.1" 200 12

One of the simplest examples for such a log file is

    data-freq --month /var/log/httpd/access_log

which will yield something like this:

    12300: 2012-01
    23400: 2012-02
    34500: 2012-03

Note the date/time information is automatically extracted from the first chunk of text that is enclosed by a pair of brackets [...].

If the access log file is very large, it is recommended to do some experiment for a part of the log until satisfactory options are determined. E.g.

    tail -1000 /var/log/httpd/access_log | \
        data-freq --[several different options]

In order to select a specific field from the log line, use the --pos option:

    # Count IP addresses
    data-freq --pos=0 < access_log
    (or)
    data-freq -p0 < access_log
    
    # Count remote usernames
    data-freq --pos=2 < access_log
    (or)
    data-freq -p2 < access_log

If the --pos option is used, it is regarded as the 0-based index for the array of words in each input line.

Multi-level analysis

data-freq is capable of aggregating frequency data at multiple levels.

E.g.

    data-freq --month --day < access_log
    (or)
    data-freq -md < access_log

where --month is for the first level, and --day is for the second level.

The output will look something like this:

    12300: 2012-01
          210: 2012-01-01
          321: 2012-01-02
          432: 2012-01-03
          ...
    23400: 2012-02
          321: 2012-02-01
          432: 2012-02-02
          543: 2012-02-03
          ...
    34500: 2012-03
          543: 2012-02-01
          654: 2012-02-02
          765: 2012-02-03
          ...

Below is another example to list top 3 users per month:

    data-freq --month --text --pos=2 --limit=3 < access_log
    (or)
    data-freq -m -tp2 -n3 < access_log

Output:

    12300: 2012-01
         1200: user1
          230: user2
          135: user3
    23400: 2012-02
         2400: user1
         1122: user4
          765: user3
    34500: 2012-03
         3600: user2
         2100: user3
         1350: user1

Note: the dates are sorted by the time-line order, while the users are sorted by the count.

Field types

There are three basic field types as below:

--text

Each line in the input is added as a text entry so that its frequency is counted.

If the --pos option is given, each line is split into chunks, and only the selected chunk(s) at the position are counted.
--number

The input is interpreted as numbers, which affects the sorting order in the output.

--pos option should usually be given, but if it is omitted, the first chunk is used as the input number.
--date

The input is parsed as date/time and formatted based on the POSIX::strftime() format. (See POSIX.) The default format is %Y-%m-%d which looks like 2001-02-03.

Unless --pos option is explicitly given, the first field enclosed by a pair of brackets [...] in the input line is automatically parsed.

The date/time format can be specified with the --strftime option, or a plus sign + followed by the format is interpreted as the --strftime option. E.g.
```
    --strftime=%m-%d
    (or)
    +%m-%d
```
The options below can be used as shortcuts for the date/time format:
```
    --year  : '%Y'
    --month : '%Y-%m'
    --day   : '%Y-%m-%d'
    --hour  : '%Y-%m-%d %H'
    --minute: '%Y-%m-%d %H:%M'
    --second: '%Y-%m-%d %H:%M:%S'
```

In order to place multiple field specifications, each of the field type option indicates the beginning of the group of options that belong to the same field.

The default type is --text and it can be omitted for the first field, but cannot be omitted from the second field on.

    data-freq --text --pos=2 # correct
    data-freq --pos=2        # ok
    
    data-freq --text --pos=2 --text --pos=0 # correct
    data-freq --pos=2 --text --pos=0        # ok
    data-freq --pos=2 --pos=0               # incorrect

Selecting fields

--pos

Selects a field at the given position in each input line. The position is a 0-based index (i.e. the first chunk is the position 0).

Multiple positions can be specified with comma-separated numbers or a range described by a .. operator.
```
    data-freq --pos=2
    data-freq --pos=1,2,5
    data-freq --pos=0..3
```

For a field with the --pos option, the input line is split into chunks by whitespaces (unless the --split option is explicitly given), while any chunk enclosed by a pair of parentheses (...), brackets [...], braces {...}, double quotes "...", or single quotes '...' is grouped as one field, even if it contains whitespaces.

Nested parentheses, brackets, and braces are not supported.

For the field of the --date type, even if the --pos option is not set, the first chunk enclosed by a pair of brackets [...] is automatically selected.

Some log formats do not enclose the date/time by brackets. In that case, the --pos option with a range operator is useful.

For example, if the log line looks like this:

    01 Jan 2012 01:02:03,456 INFO - test log

then the --pos option can be used as below:

    data-freq --pos=0..3

Limiting output

In the output, the number of records to display under each category can be limited by the options below:

--limit

Limits the records to the given number. If a negative number is specified, the number is counted from the end.
--offset

Skips as many records as the given number. If a negative number is specified, the number is counted from the end.
--zero

Short for --limit=0.

Sorting results

The output can be sorted on the per-field basis by the attributes below:

--score

Sorts by the score (left-hand side numbers).
--value

Sorts by the value (right-hand side texts).
--first

Sorts by the first occurrence in the input.
--last

Sorts by the last occurrence in the input.

The direction of the order can be controlled by these respective options:

--asc

Sorts in the ascending order
--desc

Sorts in the descending order

If the sorting and/or ordering options above are omitted, the default sorting method will be determined as follows:

1. If the field type is --text, the output will be sorted by --score by default (i.e. the most frequent text first). Otherwise (if the field type is either --number or any kind of --date), the output will be sorted by --value by default (i.e. the number-line or time-line order).

2. If the sorting type is either --score or --last, the output will be sorted in the descending order by default. Otherwise, the default is the ascending order.

Aggregating subcategory

If one of the aggregation options below is given to a field, it alters the meaning of what is displayed as the score of its parent field.

Without the aggregation, the frequency of each field is counted independently, where the parent field count is usually equal to the sum of the child field counts. The aggregation options use the alternative method instead of scoring the sum.

--unique

Scores the number of distinct values.
--max

Scores the maximum count.
--min

Scores the minimum count.
--average

Scores the average count.

Below is an example to show top 5 days in the number of distinct users:

    data-freq --day --score --limit=5 \
              --text --pos=2 --unique --zero \
              access_log
    (or)
    data-freq -dS -n5 -tp2 -Uz access_log

where --day is the daily aggregate for the first level, and --text --pos=2 is for the usernames per day.

The --score option is to sort the first field by the score (unique usernames) rather than by the date itself, and then the top 5 days will be printed out with --limit=5.

The --unique option makes the first field count the number of unique usernames instead of the total number of occurrences, while the --zero option for the second field hides all the individual usernames, since the only purpose here is to list the dates.

As a result, the output will look like

    1100: 2012-03-05
     860: 2012-02-20
     789: 2012-02-13
     641: 2012-03-12
     580: 2012-02-27

where each number on the left is the number of unique users on each day, and the listed dates are the top 5 among others.

Input format

--split

Specifies the field separator for each of the input lines.

For example, in order to analyze a CSV file,
```
    data-freq --split=, --pos=2 < input.csv
```
will count the third field in each line.

Output format

There are a number of ways to control the output format.

--indent

Alters the indent spaces (or any other characters) that repeat as many times as the depth (minus 1) at each field level. E.g.
```
    data-freq --indent=++
```
will output something like this:
```
    21: AAA
    ++12: BBB
    ++++10: CCC
    ++++ 2: DDD
    ++ 9: EEE
    ++++ 6: FFF
    ++++ 3: GGG
```

--prefix

Prepends a prefix between the indent and the score value.

Example:

   data-freq --prefix='* '

Output:

    * 21: AAA
        * 12: BBB
            * 10: CCC
            *  2: DDD
        *  9: EEE
            *  6: FFF
            *  3: GGG

--separator

Sets the separator between the score and the counted text.

Example:

    data-freq --separator=' => '

Output:

    21 => AAA
        12 => BBB
            10 => CCC
             2 => DDD
         9 => EEE
             6 => FFF
             3 => GGG

--root

Also displays the grand total at the level 0. All the subsequent levels are shifted to the right.

    34: Total
        21: AAA
            12: BBB
                10: CCC
                 2: DDD
             9: EEE
                 6: FFF
                 3: GGG
        13: HHH
            13: III
                12: JJJ
                 1: KKK

--transpose

Swaps the position of the score and the counted text.

    AAA: 21
        BBB: 12
            CCC: 10
            DDD: 2
        EEE: 9
            FFF: 6
            GGG: 3

--nopadding

Suppresses the space padding to the left, which is by default for the alignment of the counted texts.
```
    21: AAA
        12: BBB
            10: CCC
            2: DDD
        9: EEE
            6: FFF
            3: GGG
```
Note: the indent space above is strictly fixed as multiple of 4 spaces, while the texts at the same level may not be aligned.

AUTHOR

Mahiro Ando, <mahiro at cpan.org>

LICENSE AND COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.

To install Data::Freq, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Data::Freq

CPAN shell

perl -MCPAN -e shell
install Data::Freq

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

OPTIONS

EXAMPLES

DESCRIPTION

Overview

Log file analysis

Multi-level analysis

Field types

Selecting fields

Limiting output

Sorting results

Aggregating subcategory

Input format

Output format

AUTHOR

LICENSE AND COPYRIGHT

Module Install Instructions