The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Log::Parallel::ConfigCheck - Log processing configuration file validation

SYNOPSIS

 use Config::YAMLMacros qw(get_config);
 use Log::Parallel::ConfigCheck;

 my $config = get_config($config_file);
 validate_config($config);

DESCRIPTION

ConfigCheck uses Config::Checker to validate a log processing configuration that is used by process_logs. Essentially all ConfigCheck consists of is a description of the log processing configuration options.

The configuration file has three several sections. The main section is the one that defines the jobs that process logs does.

Jobs Section

The jobs section describes the processing steps that will be applied to the logs. This is the meat of the process.

The jobs are an YAML array in with a key of jobs in the main section.

The keys are:

name

Required. The name of the job is used only for diagnostics. It is not required to be unique execpt for its specified time range.

source

Required. A list of sources of information. These can come from the destination fields of other prior jobs or from the name fields of sources (see below). Multiple items may be listed (comma separated or as a YAML array) but the sources must all be in the same sort order. The input to the filter and transform steps will be in sorted order. An example source would be something like raw apache logs.

destination

Required. This is the name of what this job produces. This needs to be unique within the time range that this job is valid for. An example destination might be queries extracted from sessions.

output_format

Required. The name of the output format for the output of this job. This needs to be one of the Writers that registers itself with Log::Parallel::Writers. Exmaples are: TSV_as_sessions, Sessions, TSV.

hosts

Not implemented yet. Optional, defaults to the hosts of the privious job or source. Which hosts should the output from this job be written to.

path

Required. What is the path name where output from this job should be written. The path name will undergo macro substitutions from Config::YAMLMacros, from Log::Parallel::Durations, and from Log::Parallel::Task. These substitutions include:

DATADIR

Defined perl-host in the Hosts section.

BUCKET

The bucket number. Five digits.

YYYY =item MM =item DD =item FROM_YYYY =item FROM_MM =item FROM_DD =item DURATION

Eg 3 weeks or daily.

FROM_JD
valid_from

Optional, defaults to the earliest time based on its sources. The earliest date for which this job should be run.

valid_to

Optional, defaults to the latest time based on its sources. The last date for which this job should be run.

filter

Optional. Perl code to choose if the input $log object should be processed or ignored. A true return value indicates that the object should be processed.

To provide a closure instead of code, have a BEGIN block set $coderef to the closure. If set, code outside the BEGIN block will be invoked only once. This is how filter_config can be used.

filter_config

Optional. A HASH of extra information to provide at compile time for the filter to use.

grouper

Optional Perl code to group log objects together. The default is not to group. If grouper is set, then the $log objects will be formed into groups based on the output of the grouper function. The input is assumed to be in order so that groups form in sequence and only one group need be remembered at a time. Once grouped, the transform step will receive a reference to an array of log objects instead of the single log object it would receive if there was no grouper.

To provide a closure instead of code, have a BEGIN block set $coderef to the closure. If set, code outside the BEGIN block will be invoked only once. This is how grouper_config can be used.

grouper_config

Optional. A HASH of extra information to provide at compile time for the grouper to use.

transform

Optional. Perl code to transform input $log objects into zero or more output $log objects. This can do re-grouping to turn multiple events into a session or vice versa. It can do aggregation (see Stream::Aggregate) and collapse many log enties to statistics.

To provide a closure instead of code, have a BEGIN block set $coderef to the closure. If set, code outside the BEGIN block will be invoked only once. This is how config can be used.

config

Optional. A HASH of extra information to provide at compile time for the trasform to use.

sort_by

Optional. A list of fields (in the $log) object that to use to sort the output. The list can be comma-separated or it can be a YAML list. Each field name may be followed by unix sort flags in parenthesis. For example:

 sort_by:
   - id()
   - count(n)
   - name

The sort flags are optional, but if there are none present, then the data will be examined (which isn't free) and a guess made as to what kind of data is present. It's better to use flags. If any flag is used, then no data will be examined and any field without a flag will be treated as a text field. An empty parenthesis () signifies text.

The currently supported flags are n, g, and r. More could be added by modifying make_compare_func() in Log::Parallel::Task.

buckets

Optional. A number: how many buckets to split the output from this job into. This would be used to allow parallel processing. Defaults to one per host.

bucketizer

Not implemented yet. Optional. When splitting the output into buckets, it will be split on the modulo of the md5-sum of the return value from this bit of perl code. If you want to make sure that all URLs from the same domain end up in the same bucket, return the domain name.

To provide a closure instead of code, have a BEGIN block set $coderef to the closure. If set, code outside the BEGIN block will be invoked only once. This is how bucket_config can be used.

bucket_config

Optional. A HASH of extra information to provide at compile time for the bucketizer to use.

frequency

Optional, defaults to the frequency of it's source. How often should this job be run? This is parsed by Log::Parallel::Durations. Examples are: daily, monthly, on the 3rd Sunday each month.

timespan

Optional, defaults to the length of the frequency. How much data should be processed by the job? This is parsed by Log::Parallel::Durations. Examples are: daily, 3 weeks.

remove_after

Not Implemented Yet. Optional. How long should the output of this job be kept?

parser_config

Optional. Extra parameters (a hash) for the parsers of the output of this job.

input_config

Optional. Extra parameters for the parsers used to read the input for this job.

output_config

Optional. Extra parameters for the Writer used to save the output from this job.

DISABLED

Optional. 0 or 1. If a true value, this job is skipped.

Sources Section

The sources section specifies were the raw inputs for the log processing system can be found.

The sources are an YAML array in with a key of sources in the main section.

name

Required. The name of the source. This must be unique with other sources and jobs for the time period that this source is valid within.

hosts

Required. A list of hosts (YAML array or comma-separated) where the input files can be found.

path

Required. The path to the input files. The path name can have can have predefined and regular-expression wildcard matches. The pre-defined matches are:

%YYYY%

Match a year.

%MM%

Match a month number.

%DD%

Match a day number.

Regular expression matches are defined as %NAME=regex%. For example, if the months are 1-12 instead of 01-12, use %MM=\d\d?% instead of %MM% to match month numbers.

valid_from

Required. The earliest date for which this source is valid.

valid_to

Optional, defaults to now. The last date for which this source is valid.

format

The data format of this source. This must be one of the Parsers that registers itself with Log::Parallel::Parsers.

remove_after

Not Implemented Yet. Optional. How long until the source files should be removed to recover disk space and protect our users' privacy.

sorted_by

Optional. How is this data ordered? A list of fields (YAML array or comma-separated) from the $log objects returned by the Parser. Usually these are ordered by time.

parser_config

Optional. A has of extra parameters for the parsers that will read this data.

Hosts Section

The hosts section provides parameters for the hosts that will be used to rune the jobs and store the output from the jobs.

The hosts section is is a YAML HASH in the main section as hostsinfo. The keys are hostnames. The values are hashes with the following keys:

datadir

Required. The path to where permanent data should be be stored on this host. This path is available as %DATADIR% substitution into jobs and sources path names.

temporary_storage

Optional, defaults to /tmp. Where temporary files should be stored.

max_threads

Not Implemented Yet. Optional, default = 4. The number of simultaneous processes to run on this host.

max_memory

Not Implemented Yet. Optional, default = 5G. Amount of memory available for log processing jobs on this host.

Directives Section

The directives section is where over-all parameters are set.

These are all level 1 YAML keys.

master_node

The hostname of the control node where the header information and metadata is kept. This needs to match one of the hostnames in the hostsinfo section.

headers

The path to where header information is kept (on master_node).

metdata_data

The path to where meta data information is kept (on master_node).

LICENSE

This package may be used and redistributed under the terms of either the Artistic 2.0 or LGPL 2.1 license.