Why not adopt me?
NAME
Log::Parallel::ConfigCheck - Log processing configuration file validation
SYNOPSIS
use Config::YAMLMacros qw(get_config);
use Log::Parallel::ConfigCheck;
my $config = get_config($config_file);
validate_config($config);
DESCRIPTION
ConfigCheck uses Config::Checker to validate a log processing configuration that is used by process_logs. Essentially all ConfigCheck consists of is a description of the log processing configuration options.
The configuration file has three several sections. The main section is the one that defines the jobs that process logs does.
Jobs Section
The jobs section describes the processing steps that will be applied to the logs. This is the meat of the process.
The jobs are an YAML array in with a key of jobs
in the main section.
The keys are:
- name
-
Required. The name of the job is used only for diagnostics. It is not required to be unique execpt for its specified time range.
- source
-
Required. A list of sources of information. These can come from the
destination
fields of other prior jobs or from thename
fields of sources (see below). Multiple items may be listed (comma separated or as a YAML array) but the sources must all be in the same sort order. The input to thefilter
andtransform
steps will be in sorted order. An example source would be something likeraw apache logs
. - destination
-
Required. This is the name of what this job produces. This needs to be unique within the time range that this job is valid for. An example destination might be
queries extracted from sessions
. - output_format
-
Required. The name of the output format for the output of this job. This needs to be one of the Writers that registers itself with Log::Parallel::Writers. Exmaples are:
TSV_as_sessions
,Sessions
,TSV
. - hosts
-
Not implemented yet. Optional, defaults to the hosts of the privious job or source. Which hosts should the output from this job be written to.
- path
-
Required. What is the path name where output from this job should be written. The path name will undergo macro substitutions from Config::YAMLMacros, from Log::Parallel::Durations, and from Log::Parallel::Task. These substitutions include:
- valid_from
-
Optional, defaults to the earliest time based on its sources. The earliest date for which this job should be run.
- valid_to
-
Optional, defaults to the latest time based on its sources. The last date for which this job should be run.
- filter
-
Optional. Perl code to choose if the input
$log
object should be processed or ignored. A true return value indicates that the object should be processed.To provide a closure instead of code, have a
BEGIN
block set$coderef
to the closure. If set, code outside theBEGIN
block will be invoked only once. This is howfilter_config
can be used. - filter_config
-
Optional. A HASH of extra information to provide at compile time for the
filter
to use. - grouper
-
Optional Perl code to group log objects together. The default is not to group. If
grouper
is set, then the$log
objects will be formed into groups based on the output of thegrouper
function. The input is assumed to be in order so that groups form in sequence and only one group need be remembered at a time. Once grouped, the transform step will receive a reference to an array of log objects instead of the single log object it would receive if there was nogrouper
.To provide a closure instead of code, have a
BEGIN
block set$coderef
to the closure. If set, code outside theBEGIN
block will be invoked only once. This is howgrouper_config
can be used. - grouper_config
-
Optional. A HASH of extra information to provide at compile time for the
grouper
to use. - transform
-
Optional. Perl code to transform input
$log
objects into zero or more output$log
objects. This can do re-grouping to turn multiple events into a session or vice versa. It can do aggregation (see Stream::Aggregate) and collapse many log enties to statistics.To provide a closure instead of code, have a
BEGIN
block set$coderef
to the closure. If set, code outside theBEGIN
block will be invoked only once. This is howconfig
can be used. - config
-
Optional. A HASH of extra information to provide at compile time for the
trasform
to use. - sort_by
-
Optional. A list of fields (in the
$log
) object that to use to sort the output. The list can be comma-separated or it can be a YAML list. Each field name may be followed by unix sort flags in parenthesis. For example:sort_by: - id() - count(n) - name
The sort flags are optional, but if there are none present, then the data will be examined (which isn't free) and a guess made as to what kind of data is present. It's better to use flags. If any flag is used, then no data will be examined and any field without a flag will be treated as a text field. An empty parenthesis
()
signifies text.The currently supported flags are
n
,g
, andr
. More could be added by modifyingmake_compare_func()
in Log::Parallel::Task. - buckets
-
Optional. A number: how many buckets to split the output from this job into. This would be used to allow parallel processing. Defaults to one per host.
- bucketizer
-
Not implemented yet. Optional. When splitting the output into buckets, it will be split on the modulo of the md5-sum of the return value from this bit of perl code. If you want to make sure that all URLs from the same domain end up in the same bucket, return the domain name.
To provide a closure instead of code, have a
BEGIN
block set$coderef
to the closure. If set, code outside theBEGIN
block will be invoked only once. This is howbucket_config
can be used. - bucket_config
-
Optional. A HASH of extra information to provide at compile time for the
bucketizer
to use. - frequency
-
Optional, defaults to the frequency of it's
source
. How often should this job be run? This is parsed by Log::Parallel::Durations. Examples are:daily
,monthly
,on the 3rd Sunday each month
. - timespan
-
Optional, defaults to the length of the
frequency
. How much data should be processed by the job? This is parsed by Log::Parallel::Durations. Examples are:daily
,3 weeks
. - remove_after
-
Not Implemented Yet. Optional. How long should the output of this job be kept?
- parser_config
-
Optional. Extra parameters (a hash) for the parsers of the output of this job.
- input_config
-
Optional. Extra parameters for the parsers used to read the input for this job.
- output_config
-
Optional. Extra parameters for the Writer used to save the output from this job.
- DISABLED
-
Optional.
0
or1
. If a true value, this job is skipped.
Sources Section
The sources section specifies were the raw inputs for the log processing system can be found.
The sources are an YAML array in with a key of sources
in the main section.
- name
-
Required. The name of the source. This must be unique with other sources and jobs for the time period that this source is valid within.
- hosts
-
Required. A list of hosts (YAML array or comma-separated) where the input files can be found.
- path
-
Required. The path to the input files. The path name can have can have predefined and regular-expression wildcard matches. The pre-defined matches are:
Regular expression matches are defined as %NAME=regex%. For example, if the months are 1-12 instead of 01-12, use
%MM=\d\d?%
instead of%MM%
to match month numbers. - valid_from
-
Required. The earliest date for which this source is valid.
- valid_to
-
Optional, defaults to
now
. The last date for which this source is valid. - format
-
The data format of this source. This must be one of the Parsers that registers itself with Log::Parallel::Parsers.
- remove_after
-
Not Implemented Yet. Optional. How long until the source files should be removed to recover disk space and protect our users' privacy.
- sorted_by
-
Optional. How is this data ordered? A list of fields (YAML array or comma-separated) from the
$log
objects returned by the Parser. Usually these are ordered by time. - parser_config
-
Optional. A has of extra parameters for the parsers that will read this data.
Hosts Section
The hosts section provides parameters for the hosts that will be used to rune the jobs and store the output from the jobs.
The hosts section is is a YAML HASH in the main section as hostsinfo
. The keys are hostnames. The values are hashes with the following keys:
- datadir
-
Required. The path to where permanent data should be be stored on this host. This path is available as
%DATADIR%
substitution intojobs
andsources
path names. - temporary_storage
-
Optional, defaults to
/tmp
. Where temporary files should be stored. - max_threads
-
Not Implemented Yet. Optional, default = 4. The number of simultaneous processes to run on this host.
- max_memory
-
Not Implemented Yet. Optional, default = 5G. Amount of memory available for log processing jobs on this host.
Directives Section
The directives section is where over-all parameters are set.
These are all level 1 YAML keys.
- master_node
-
The hostname of the control node where the header information and metadata is kept. This needs to match one of the hostnames in the
hostsinfo
section. - headers
-
The path to where header information is kept (on
master_node
). - metdata_data
-
The path to where meta data information is kept (on
master_node
).
LICENSE
This package may be used and redistributed under the terms of either the Artistic 2.0 or LGPL 2.1 license.