Log::Parallel::ConfigCheck - Log processing configuration file validation
use YAML::ConfigFile qw(get_config); use Log::Parallel::ConfigCheck; my $config = get_config($config_file); validate_config($config);
ConfigCheck uses Config::Checker to validate a log processing configuration that is used by process_logs. Essentially all ConfigCheck consists of is a description of the log processing configuration options.
The configuration file has three several sections. The main section is the one that defines the jobs that process logs does.
The jobs section describes the processing steps that will be applied to the logs. This is the meat of the process.
The jobs are an YAML array in with a key of jobs in the main section.
jobs
The keys are:
Required. The name of the job is used only for diagnostics. It is not required to be unique execpt for its specified time range.
Required. A list of sources of information. These can come from the destination fields of other prior jobs or from the name fields of sources (see below). Multiple items may be listed (comma separated or as a YAML array) but the sources must all be in the same sort order. The input to the filter and transform steps will be in sorted order. An example source would be something like raw apache logs.
destination
name
filter
transform
raw apache logs
Required. This is the name of what this job produces. This needs to be unique within the time range that this job is valid for. An example destination might be queries extracted from sessions.
queries extracted from sessions
Required. The name of the output format for the output of this job. This needs to be one of the Writers that registers itself with Log::Parallel::Writers. Exmaples are: TSV_as_sessions, Sessions, TSV.
TSV_as_sessions
Sessions
TSV
Not implemented yet. Optional, defaults to the hosts of the privious job or source. Which hosts should the output from this job be written to.
Required. What is the path name where output from this job should be written. The path name will undergo macro substitutions from YAML::ConfigFile, from Log::Parallel::Durations, and from Log::Parallel::Task. These substitutions include:
Defined perl-host in the Hosts section.
The bucket number. Five digits.
Eg 3 weeks or daily.
3 weeks
daily
Optional, defaults to the earliest time based on its sources. The earliest date for which this job should be run.
Optional, defaults to the latest time based on its sources. The last date for which this job should be run.
Optional. Perl code to choose if the input $log object should be processed or ignored. A true return value indicates that the object should be processed.
$log
To provide a closure instead of code, have a BEGIN block set $coderef to the closure. If set, code outside the BEGIN block will be invoked only once. This is how filter_config can be used.
BEGIN
$coderef
filter_config
Optional. A HASH of extra information to provide at compile time for the filter to use.
Optional Perl code to group log objects together. The default is not to group. If grouper is set, then the $log objects will be formed into groups based on the output of the grouper function. The input is assumed to be in order so that groups form in sequence and only one group need be remembered at a time. Once grouped, the transform step will receive a reference to an array of log objects instead of the single log object it would receive if there was no grouper.
grouper
To provide a closure instead of code, have a BEGIN block set $coderef to the closure. If set, code outside the BEGIN block will be invoked only once. This is how grouper_config can be used.
grouper_config
Optional. A HASH of extra information to provide at compile time for the grouper to use.
Optional. Perl code to transform input $log objects into zero or more output $log objects. This can do re-grouping to turn multiple events into a session or vice versa. It can do aggregation (see Stream::Aggregate) and collapse many log enties to statistics.
To provide a closure instead of code, have a BEGIN block set $coderef to the closure. If set, code outside the BEGIN block will be invoked only once. This is how config can be used.
config
Optional. A HASH of extra information to provide at compile time for the trasform to use.
trasform
Optional. A list of fields (in the $log) object that to use to sort the output. The list can be comma-separated or it can be a YAML list. Each field name may be followed by unix sort flags in parenthesis. For example:
sort_by: - id() - count(n) - name
The sort flags are optional, but if there are none present, then the data will be examined (which isn't free) and a guess made as to what kind of data is present. It's better to use flags. If any flag is used, then no data will be examined and any field without a flag will be treated as a text field. An empty parenthesis () signifies text.
()
The currently supported flags are n, g, and r. More could be added by modifying make_compare_func() in Log::Parallel::Task.
n
g
r
make_compare_func()
Optional. A number: how many buckets to split the output from this job into. This would be used to allow parallel processing. Defaults to one per host.
Not implemented yet. Optional. When splitting the output into buckets, it will be split on the modulo of the md5-sum of the return value from this bit of perl code. If you want to make sure that all URLs from the same domain end up in the same bucket, return the domain name.
To provide a closure instead of code, have a BEGIN block set $coderef to the closure. If set, code outside the BEGIN block will be invoked only once. This is how bucket_config can be used.
bucket_config
Optional. A HASH of extra information to provide at compile time for the bucketizer to use.
bucketizer
Optional, defaults to the frequency of it's source. How often should this job be run? This is parsed by Log::Parallel::Durations. Examples are: daily, monthly, on the 3rd Sunday each month.
source
monthly
on the 3rd Sunday each month
Optional, defaults to the length of the frequency. How much data should be processed by the job? This is parsed by Log::Parallel::Durations. Examples are: daily, 3 weeks.
frequency
Not Implemented Yet. Optional. How long should the output of this job be kept?
Optional. Extra parameters (a hash) for the parsers of the output of this job.
Optional. Extra parameters for the parsers used to read the input for this job.
Optional. Extra parameters for the Writer used to save the output from this job.
Optional. 0 or 1. If a true value, this job is skipped.
0
1
The sources section specifies were the raw inputs for the log processing system can be found.
The sources are an YAML array in with a key of sources in the main section.
sources
Required. The name of the source. This must be unique with other sources and jobs for the time period that this source is valid within.
Required. A list of hosts (YAML array or comma-separated) where the input files can be found.
Required. The path to the input files. The path name can have can have predefined and regular-expression wildcard matches. The pre-defined matches are:
Match a year.
Match a month number.
Match a day number.
Regular expression matches are defined as %NAME=regex%. For example, if the months are 1-12 instead of 01-12, use %MM=\d\d?% instead of %MM% to match month numbers.
%MM=\d\d?%
%MM%
Required. The earliest date for which this source is valid.
Optional, defaults to now. The last date for which this source is valid.
now
The data format of this source. This must be one of the Parsers that registers itself with Log::Parallel::Parsers.
Not Implemented Yet. Optional. How long until the source files should be removed to recover disk space and protect our users' privacy.
Optional. How is this data ordered? A list of fields (YAML array or comma-separated) from the $log objects returned by the Parser. Usually these are ordered by time.
Optional. A has of extra parameters for the parsers that will read this data.
The hosts section provides parameters for the hosts that will be used to rune the jobs and store the output from the jobs.
The hosts section is is a YAML HASH in the main section as hostsinfo. The keys are hostnames. The values are hashes with the following keys:
hostsinfo
Required. The path to where permanent data should be be stored on this host. This path is available as %DATADIR% substitution into jobs and sources path names.
%DATADIR%
Optional, defaults to /tmp. Where temporary files should be stored.
/tmp
Not Implemented Yet. Optional, default = 4. The number of simultaneous processes to run on this host.
Not Implemented Yet. Optional, default = 5G. Amount of memory available for log processing jobs on this host.
The directives section is where over-all parameters are set.
These are all level 1 YAML keys.
The hostname of the control node where the header information and metadata is kept. This needs to match one of the hostnames in the hostsinfo section.
The path to where header information is kept (on master_node).
master_node
The path to where meta data information is kept (on master_node).
This package may be used and redistributed under the terms of either the Artistic 2.0 or LGPL 2.1 license.
To install Log::Parallel, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Log::Parallel
CPAN shell
perl -MCPAN -e shell install Log::Parallel
For more information on module installation, please visit the detailed CPAN module installation guide.