The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

process_logs - read, manipulate and report on various log files

USAGE

 process_logs [options] -c configuration_file.yml

OPTIONS

 -c --config_file file                  Specifies the configuration file
 -a --reprocess_all                     Reprocess all files
 --reprocess_from date                  Reprocess everything after [date]
 -v --verbose                           Increase debugging output (can be repeated)
 --min_start_date date                  Force all start dates to be at least [date]
 --max_end_date date                    Force all end dates to be no more than [date]
 --priority_bias METHOD                 Choose priorty adjustment from: 'random',  'date', 'depth'
 --target_date DATE                     For priority bias date & depth, aim for [date]
 --ignore_code_dependencies, --no_code  Ignore dependencies on code 

DESCRIPTION

Process logs using the Log::Parallel system.

process_logs is the driver script for processing data logs through a series of jobs specified in a configuration file.

Each job consists of a set of steps to process input files and create an output file (possibly bucketized). This very much like a map-reduce framework. The steps are:

1. Parse

The first step is to parse the input files. The input files can come from multiple places/steps and be in multiple formats. They must all be sorted on the same fields so that they can be joined together in an ordered stream.

2. Filter

As items are read in, the filter code is executed. Items are dropped unless the filter code returns a true value.

4. Group

The items that make it past the filter can optionally be grouped together so that they're passed to the next starge as an array of items.

4. Transform

The transform step consumes items and generate items. It consumes items one-by-one (or one group at a time), but it can produce zero or many items for each one it consumes. It can take events and squish them together into a session; or it can take a session and break it apart into events; or it can take sessions and produce a single aggregated result when it had consumed all the input.

5. Bucketize

As new resultant items are generated, they can be bucketized into many buckets and split across a cluster.

6. Write

The resultant items are writen in the format specified. Since the next step may run things though unix sort, the output format may need to be squished onto one line.

7. Sort

The output files get sorted according to fields defined in the resultant items.

8. Post-Sort Transform

If the writer had to encode the output for unix sort, it gets a chance to un-encode it after sorting so that it's in its desired format.

CONFIGURATION FILE

The configuration file is in YAML format and is preprocessed with Config::YAMLMacros which provides some macro directives (include and define).

It is post-processed with Config::Checker which allows for some flexibility (sloppyness) on the part of configuration writers. Single items will be automatically turned into lists when needed.

The configuration file has three several sections. The main section is the one that defines the jobs that process logs does.

The exact details of each section are described in Log::Parallel::ConfigCheck.

SEE ALSO

The Parser API is defined in Log::Parallel::Parsers. The Writers API is defined in Log::Parallel::Writers. Descriptions of the steps can be found in Log::Parallel::ConfigCheck.

LICENSE

This package may be used and redistributed under the terms of either the Artistic 2.0 or LGPL 2.1 license.