The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

data-freq - a text frequency analysis tool

SYNOPSIS

    data-freq [options] [--] [files..]

OPTIONS

  • Field Type

        -t | --text           -y | --year
        -u | --number         -m | --month
        -d | --date | --day   --hour --minute --second
        
        +%m-%d +%H +%H:%M etc. | --strftime=FMT

    For multiple fields, each Field Type Option begins new field specification.

  • Field Selector

        -p NUM | --pos=NUM

    NUM can be zero, positive, negative, multiple separated by commas (,), and/or a range with a .. operator.

  • Field Output

        -n NUM | --limit=NUM    -z | --zero
        -o NUM | --offset=NUM

    NUM can be zero, positive, or negative.

  • Field Aggregation

        -U | --unique   -M | --max   -N | --min   -Y | --average
  • Field Sorting

        -V | --value   -F | --first   -A | --asc
        -S | --score   -L | --last    -D | --desc
  • Input Format

        -b STR | --split=STR
  • Output Format

        -I STR | --indent=STR      -R | --root
        -P STR | --prefix=STR      -T | --transpose
        -B STR | --separator=STR   -O | --nopadding
  • Help

        -v | --version   -h | --help   -a | --man   -c | --check

EXAMPLES

  • Monthly view counts

        Long:  data-freq --month < access_log
        Short: data-freq -m < access_log
  • Monthly + Daily

        Long:  data-freq --month --day < access_log
        Short: data-freq -md < access_log
  • Monthly + Top 3 users per month

        Long:  data-freq --month \
                         --text --pos=2 --limit=3 \
                         access_log
        Short: data-freq -m -tp2 -n3 access_log
  • Top 5 days in the number of distinct users

        Long:  data-freq --day --score --limit=5 \
                         --text --pos=2 --unique --zero \
                         access_log
        Short: data-freq -dS -n5 -tp2 -Uz access_log
  • Hourly aggregation

        Long:  data-freq --strftime %H
        Short: data-freq +%H

DESCRIPTION

Overview

data-freq is a command line tool to analyze frequency of particular types of text data. It is based on the corresponding Perl module Data::Freq.

For example, consider an input file:

    Abc Def
    Def Ghi
    Ghi Jkl
    Abc Def
    Def Ghi
    Abc Def

The command can be executed as below:

    data-freq filename
    (or)
    data-freq < filename

Then the output will be

    3: Abc Def
    2: Def Ghi
    1: Ghi Jkl

where the number on the left indicates how many times each line of text appears in the input.

Log file analysis

This tool is designed especially in favor of log file analysis.

A typical log file for the Apache web server consists of lines like this:

    1.2.3.4 - user1 [01/Jan/2012:01:02:03 +0000] "GET / HTTP/1.1" 200 12

One of the simplest examples for such a log file is

    data-freq --month /var/log/httpd/access_log

which will yield something like this:

    12300: 2012-01
    23400: 2012-02
    34500: 2012-03

Note the date/time information is automatically extracted from the first chunk of text that is enclosed by a pair of brackets [...].

If the access log file is very large, it is recommended to do some experiment for a part of the log until satisfactory options are determined. E.g.

    tail -1000 /var/log/httpd/access_log | \
        data-freq --[several different options]

In order to select a specific field from the log line, use the --pos option:

    # Count IP addresses
    data-freq --pos=0 < access_log
    (or)
    data-freq -p0 < access_log
    
    # Count remote usernames
    data-freq --pos=2 < access_log
    (or)
    data-freq -p2 < access_log

If the --pos option is used, it is regarded as the 0-based index for the array of words in each input line.

Multi-level analysis

data-freq is capable of aggregating frequency data at multiple levels.

E.g.

    data-freq --month --day < access_log
    (or)
    data-freq -md < access_log

where --month is for the first level, and --day is for the second level.

The output will look something like this:

    12300: 2012-01
          210: 2012-01-01
          321: 2012-01-02
          432: 2012-01-03
          ...
    23400: 2012-02
          321: 2012-02-01
          432: 2012-02-02
          543: 2012-02-03
          ...
    34500: 2012-03
          543: 2012-02-01
          654: 2012-02-02
          765: 2012-02-03
          ...

Below is another example to list top 3 users per month:

    data-freq --month --text --pos=2 --limit=3 < access_log
    (or)
    data-freq -m -tp2 -n3 < access_log

Output:

    12300: 2012-01
         1200: user1
          230: user2
          135: user3
    23400: 2012-02
         2400: user1
         1122: user4
          765: user3
    34500: 2012-03
         3600: user2
         2100: user3
         1350: user1

Note: the dates are sorted by the time-line order, while the users are sorted by the count.

Field types

There are three basic field types as below:

  • --text

    Each line in the input is added as a text entry so that its frequency is counted.

    If the --pos option is given, each line is split into chunks, and only the selected chunk(s) at the position are counted.

  • --number

    The input is interpreted as numbers, which affects the sorting order in the output.

    --pos option should usually be given, but if it is omitted, the first chunk is used as the input number.

  • --date

    The input is parsed as date/time and formatted based on the POSIX::strftime() format. (See POSIX.) The default format is %Y-%m-%d which looks like 2001-02-03.

    Unless --pos option is explicitly given, the first field enclosed by a pair of brackets [...] in the input line is automatically parsed.

    The date/time format can be specified with the --strftime option, or a plus sign + followed by the format is interpreted as the --strftime option. E.g.

        --strftime=%m-%d
        (or)
        +%m-%d

    The options below can be used as shortcuts for the date/time format:

        --year  : '%Y'
        --month : '%Y-%m'
        --day   : '%Y-%m-%d'
        --hour  : '%Y-%m-%d %H'
        --minute: '%Y-%m-%d %H:%M'
        --second: '%Y-%m-%d %H:%M:%S'

In order to place multiple field specifications, each of the field type option indicates the beginning of the group of options that belong to the same field.

The default type is --text and it can be omitted for the first field, but cannot be omitted from the second field on.

    data-freq --text --pos=2 # correct
    data-freq --pos=2        # ok
    
    data-freq --text --pos=2 --text --pos=0 # correct
    data-freq --pos=2 --text --pos=0        # ok
    data-freq --pos=2 --pos=0               # incorrect

Selecting fields

  • --pos

    Selects a field at the given position in each input line. The position is a 0-based index (i.e. the first chunk is the position 0).

    Multiple positions can be specified with comma-separated numbers or a range described by a .. operator.

        data-freq --pos=2
        data-freq --pos=1,2,5
        data-freq --pos=0..3

For a field with the --pos option, the input line is split into chunks by whitespaces (unless the --split option is explicitly given), while any chunk enclosed by a pair of parentheses (...), brackets [...], braces {...}, double quotes "...", or single quotes '...' is grouped as one field, even if it contains whitespaces.

Nested parentheses, brackets, and braces are not supported.

For the field of the --date type, even if the --pos option is not set, the first chunk enclosed by a pair of brackets [...] is automatically selected.

Some log formats do not enclose the date/time by brackets. In that case, the --pos option with a range operator is useful.

For example, if the log line looks like this:

    01 Jan 2012 01:02:03,456 INFO - test log

then the --pos option can be used as below:

    data-freq --pos=0..3

Limiting output

In the output, the number of records to display under each category can be limited by the options below:

  • --limit

    Limits the records to the given number. If a negative number is specified, the number is counted from the end.

  • --offset

    Skips as many records as the given number. If a negative number is specified, the number is counted from the end.

  • --zero

    Short for --limit=0.

Sorting results

The output can be sorted on the per-field basis by the attributes below:

  • --score

    Sorts by the score (left-hand side numbers).

  • --value

    Sorts by the value (right-hand side texts).

  • --first

    Sorts by the first occurrence in the input.

  • --last

    Sorts by the last occurrence in the input.

The direction of the order can be controlled by these respective options:

  • --asc

    Sorts in the ascending order

  • --desc

    Sorts in the descending order

If the sorting and/or ordering options above are omitted, the default sorting method will be determined as follows:

1. If the field type is --text, the output will be sorted by --score by default (i.e. the most frequent text first). Otherwise (if the field type is either --number or any kind of --date), the output will be sorted by --value by default (i.e. the number-line or time-line order).

2. If the sorting type is either --score or --last, the output will be sorted in the descending order by default. Otherwise, the default is the ascending order.

Aggregating subcategory

If one of the aggregation options below is given to a field, it alters the meaning of what is displayed as the score of its parent field.

Without the aggregation, the frequency of each field is counted independently, where the parent field count is usually equal to the sum of the child field counts. The aggregation options use the alternative method instead of scoring the sum.

  • --unique

    Scores the number of distinct values.

  • --max

    Scores the maximum count.

  • --min

    Scores the minimum count.

  • --average

    Scores the average count.

Below is an example to show top 5 days in the number of distinct users:

    data-freq --day --score --limit=5 \
              --text --pos=2 --unique --zero \
              access_log
    (or)
    data-freq -dS -n5 -tp2 -Uz access_log

where --day is the daily aggregate for the first level, and --text --pos=2 is for the usernames per day.

The --score option is to sort the first field by the score (unique usernames) rather than by the date itself, and then the top 5 days will be printed out with --limit=5.

The --unique option makes the first field count the number of unique usernames instead of the total number of occurrences, while the --zero option for the second field hides all the individual usernames, since the only purpose here is to list the dates.

As a result, the output will look like

    1100: 2012-03-05
     860: 2012-02-20
     789: 2012-02-13
     641: 2012-03-12
     580: 2012-02-27

where each number on the left is the number of unique users on each day, and the listed dates are the top 5 among others.

Input format

  • --split

    Specifies the field separator for each of the input lines.

    For example, in order to analyze a CSV file,

        data-freq --split=, --pos=2 < input.csv

    will count the third field in each line.

Output format

There are a number of ways to control the output format.

  • --indent

    Alters the indent spaces (or any other characters) that repeat as many times as the depth (minus 1) at each field level. E.g.

        data-freq --indent=++

    will output something like this:

        21: AAA
        ++12: BBB
        ++++10: CCC
        ++++ 2: DDD
        ++ 9: EEE
        ++++ 6: FFF
        ++++ 3: GGG
  • --prefix

    Prepends a prefix between the indent and the score value.

    Example:

       data-freq --prefix='* '

    Output:

        * 21: AAA
            * 12: BBB
                * 10: CCC
                *  2: DDD
            *  9: EEE
                *  6: FFF
                *  3: GGG
  • --separator

    Sets the separator between the score and the counted text.

    Example:

        data-freq --separator=' => '

    Output:

        21 => AAA
            12 => BBB
                10 => CCC
                 2 => DDD
             9 => EEE
                 6 => FFF
                 3 => GGG
  • --root

    Also displays the grand total at the level 0. All the subsequent levels are shifted to the right.

        34: Total
            21: AAA
                12: BBB
                    10: CCC
                     2: DDD
                 9: EEE
                     6: FFF
                     3: GGG
            13: HHH
                13: III
                    12: JJJ
                     1: KKK
  • --transpose

    Swaps the position of the score and the counted text.

        AAA: 21
            BBB: 12
                CCC: 10
                DDD: 2
            EEE: 9
                FFF: 6
                GGG: 3
  • --nopadding

    Suppresses the space padding to the left, which is by default for the alignment of the counted texts.

        21: AAA
            12: BBB
                10: CCC
                2: DDD
            9: EEE
                6: FFF
                3: GGG

    Note: the indent space above is strictly fixed as multiple of 4 spaces, while the texts at the same level may not be aligned.

AUTHOR

Mahiro Ando, <mahiro at cpan.org>

LICENSE AND COPYRIGHT

Copyright 2012 Mahiro Ando.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.