NAME

PEQUEL - Pequel User Guide

OVERVIEW -- WHAT IS PEQUEL?

Pequel is a comprehensive system for data file processing and transformation. It features a simple, user-friendly event driven scripting interface that transparently generates, builds and executes highly efficient data-processing programs. By using the Pequel scripting language, the user can create and maintain complex data transformation processes quickly, easily, and accurately. Incidentally, the name pequel is derived from perl'ish sequel.

The Pequel system can be used by both technical (programmers) and non-technical end users. For non-technicasl users the Pequel scripting language is simple to learn and Pequel will transperantly generate, build and execute the transformation process. For developers the generated transformation program can be examined and extended, though this is rarely necessary as the scripting language contains constructs that are powerfull enough to handle even the most complex tranformation process. A Perl module Pequel.pm is provided for developers which will allow the creation of Pequel processes within Perl programs.

The Pequel scripting language is both simple and powerfull. It is event driven with each event defining a specific stage in the overall transformation process. Each event section is filled in systematically by a list of items. These items can be condition statements, field names, property settings, aggregation statements, calculation statements, and so on. A full and comprehensive array of aggregates and macros are available, as well as full Perl regular expressions within statements.

Pequel generates highly efficient Perl and C code. The generated code is as efficient as hand-written code. The emphasis in the generated code is performance -- to process maximum records in minimum time. The generated code can be dumped into a program file and executed independently of Pequel.

The Pequel scipt is self-documenting via pequeldoc. Pequel will automatically generate the Pequel Script Programmer's Reference Manual in pdf format. This manual contains detailed and summarised information about the script, and includes cross-reference information. It will also contain an optional listing of the generated program.

Pequel is installed as a Perl module.

Pequel currently supports the following incoming data stream formats: variable length delimited, CVS, fixed length, Apache CLF, and anything else that Perl pack/unpack can handle.

Pequel has a multitude of uses:

Selecting Columns: Use Pequel to output selected columns from an input data stream.
Selecting Records: Output selected records based on filtering conditional statements. Full Perl regular expressions are available.
Deriving New Columns: Derive new columns using simple to complex expressions. Perform calculations on input fields to generate new (derived) fields, using Perl expressions. Calculations can be performed on both numeric fields (mathematical) and string fields (such as concatenation, substr, etc).
Grouping and Aggregating Data: Records with similar characteristics can be grouped together. Calculate aggregations, such as max, min, mean, sum, and count, on grouped record sets.
In-Memory Sort-less Aggregation: Grouping can be performed in memory on unsorted input data using the hash option.
Statistics: Pequel provides a comprehensive array of statistical aggregate functions.
Data Cleansing: Pequel can be effectively used for checking and resolving invalid data.
Data Frequency/Quality Analysis: TBD
Data Conversion: Convert data using any of the built-in macros and Perl regular expressions. Perform any kind of data conversion. These include, converting from one data type to another, reformatting, case change, splitting a field into two or more fields, combining two or more fields into one field, converting date fields from one date format to another, padding, etc.
Distributed Data Processing: Data can be distributed based on conditions to multiple Pequel processes.
Combining Data: Data output from multiple Pequel processes can be combined into the incoming data stream.
Merging Data: Data from any number of external files or other Pequel processes can be merged via the Pequel tables facility.
Piped Data Processing: The output from one Pequel process can be piped into a second Pequel process simply by specifying the first script name as the input_file property for the second script.
Array Fields: Pequel supports array fields and provides a comprehensive set of array macros to manipulate or generate array fields.
Database Connectivity: Direct access to database (Oracle, Sqlite, etc) tables via the Pequel table facility. Pequel will generate low level database API code. Currently supported databases are Oracle (via OCI), and Sqlite.

USAGE

pequel scriptfile.pql < file_in > file_out: Execute pequel with scriptfile.pql script to process file_in data file, resulting in file_out. The scriptfile.pql will contain the transformation instructions.
pequel -c scriptfile.pql: Check the syntax of the pequel script scriptfile.pql.
pequel -viewcode scriptfile.pql: Generate and display the code for the pequel script scriptfile.pql.
pequel -dumpcode scriptfile.pql: Generate the pequel code for the script scriptfile.pql and save generated code in the file scriptname.pql.2.code.
pequel -v: Display version informatio for Pequel.
pequel -usage: Display Pequel usage command summary.
pequel -pequeldoc pdf -detail scriptfile.pql: Generate the Script Reference document in pdf format for the Pequel script scriptfile.pql. The document will include a section showing the generated code (-detail).

QUICK START

Create Pequel Script

Use your prefered text editor to create a pequel script myscript.pql. Syntax highlighting is available for vim with the pequel.vim syntax file (in vim/sytnax) -- copy the pequel.vim file into the syntax directory of the vim installation.

All that is required is to fill in, at least, the output section, or specify transfer option. The transfer option will have the effect of copying all input field values to the output. This is effectively a straight through process -- the resulting output is identical to the input.

 options
     transfer
 
 input section
     PRODUCT,
     COST_PRICE,
     DESCRIPTION,
     SALES_CODE,
     SALES_PRICE,
     QUANTITY,
     SALES_DATE,
     LOCATION
 
 output section

Check The Pequel Script

Do a syntax check on the script by using the Pequel -c option. This should return the words myscript.pql Syntax OK.

pequel -c myscript.pql

myscript.pql Syntax OK

Dump and View The Generated Perl Program

Optionally, the generated Perl program can be dumped and viewed. The program will be dumped in a file with the same name and path as the script with a '.2.code' suffix.

pequel -dumpcode myscript.pql

Processing pequel script 'myscript.pql'...................

->myscript.pql.2.code

Run The Pequel Script

If syntax check is ok, run the script -- the sample.data data file in the examples directory can be used:

pequel myscript.pql < inputdata > outputdata

TUTORIAL

Select A Subset Of Records

We next do something usefull to transform the input data. Create a filter to output a subset of records, consisting of records which have LOCATION starting with 10. The filter example uses a Perl regular expression to match the LOCATION field content with the Perl regular expression =~ /^10/. This is specified in the filter section. Check and run the updated script as instructed above:

 options
     transfer
 
 input section
     PRODUCT,
     COST_PRICE,
     DESCRIPTION,
     SALES_CODE,
     SALES_PRICE,
     QUANTITY,
     SALES_DATE,
     LOCATION
 
 filter
     LOCATION =~ /^10/

Create New Derived Fields

Create additional, derived fields based on the other input fields. In our example, two new fields are added COST_VALUE and SALES_VALUE. Derived fields must be specified in the input section after the last input field. The derived field name is followed by the => operator, and a calculation expression. Derived fields will also be output when the transfer options is specified.

 options
     transfer
 
 input section
     PRODUCT,
     COST_PRICE,
     DESCRIPTION,
     SALES_CODE,
     SALES_PRICE,
     QUANTITY,
     SALES_DATE,
     LOCATION,
     COST_VALUE => COST_PRICE * QUANTITY,
     SALES_VALUE => SALES_PRICE * QUANTITY
 
 filter
     LOCATION =~ /^10/
 
 output section

Select Which Fields To Output

In the above examples, the output record has the same (field) format as the input record, plus the additional derived fields. In the following example we select which fields to output, and their order, on the output record. To do this we need to remove the transfer option, and create the output section. The output fields PRODUCT, LOCATION, DESCRIPTION, QUANTITY, COST_VALUE, and SALES_VALUE are specified to create a new output format. In this example, all the output field names have the same name as the input fields.

 options
 
 input section
     PRODUCT,
     COST_PRICE,
     DESCRIPTION,
     SALES_CODE,
     SALES_PRICE,
     QUANTITY,
     SALES_DATE,
     LOCATION,
     COST_VALUE => COST_PRICE * QUANTITY,
     SALES_VALUE => SALES_PRICE * QUANTITY
 
 filter
     LOCATION =~ /^10/
 
 output section
     string PRODUCT      PRODUCT,
     string LOCATION     LOCATION,
     string DESCRIPTION  DESCRIPTION,
     numeric QUANTITY    QUANTITY,
     decimal COST_VALUE  COST_VALUE,
     decimal SALES_VALUE SALES_VALUE

Group Records For Analysis

Records with similar characteristics can be grouped together, and aggregations can then be performed on the grouped records' data. The following example groups the records by LOCATION, and sums the COST_VALUE and SALES_VALUE fields within each group. Grouping is activated by creating a group by section. Input data must also be sorted on the grouping field(s). If the data is not pre-sorted then this needs to be done in the script by creating a sort by section. Alternatively, by specifying the hash option, the input data need not be sorted.

 options
 
 input section
     PRODUCT,
     COST_PRICE,
     DESCRIPTION,
     SALES_CODE,
     SALES_PRICE,
     QUANTITY,
     SALES_DATE,
     LOCATION,
     COST_VALUE => COST_PRICE * QUANTITY,
     SALES_VALUE => SALES_PRICE * QUANTITY
 
 filter
     LOCATION =~ /^10/
 
 sort by
     LOCATION

 group by
     LOCATION

 output section
     string LOCATION     LOCATION,
     string PRODUCT      PRODUCT,
     string DESCRIPTION  DESCRIPTION,
     numeric QUANTITY    QUANTITY,
     decimal COST_VALUE  sum COST_VALUE,
     decimal SALES_VALUE sum SALES_VALUE

Select A Subset Of Grouped Records

A subset of groups can be select by creating a having section. The having section is similar to the filter section, but instead is applied to the aggregated group of records. In this example we will output only records for locations which have a total SALES_VALUE of 1000 or more. Note that SALES_VALUE in the having section refers to the output field (sum SALES_VALUE) and not the input field with same name (SALES_PRICE * QUANTITY). The having section gives preference to output fields when interpreting field names.

 options
 
 input section
     PRODUCT,
     COST_PRICE,
     DESCRIPTION,
     SALES_CODE,
     SALES_PRICE,
     QUANTITY,
     SALES_DATE,
     LOCATION,
     COST_VALUE => COST_PRICE * QUANTITY,
     SALES_VALUE => SALES_PRICE * QUANTITY
 
 filter
     LOCATION =~ /^10/
 
 sort by
     LOCATION

 group by
     LOCATION

 output section
     string LOCATION     LOCATION,
     string PRODUCT      PRODUCT,
     string DESCRIPTION  DESCRIPTION,
     numeric QUANTITY    QUANTITY,
     decimal COST_VALUE  sum COST_VALUE,
     decimal SALES_VALUE sum SALES_VALUE

 having
     SALES_VALUE >= 1000

Aggregation Based On Conditions

Output fields can be aggregated conditionally. That is, the aggregation will only occur for records, within the group, that evaluate the condition to true. This is done by adding a where clause to the aggregate function. In this example we create three new output fields SALES_VALUE_RETAIL, SALES_VALUE_WSALE and SALES_VALUE_OTHER. These fields will contain the sales value for records within the group which have sales code equal to 'R', 'W', and other codes, respectively.

 options
 
 input section
     PRODUCT,
     COST_PRICE,
     DESCRIPTION,
     SALES_CODE,
     SALES_PRICE,
     QUANTITY,
     SALES_DATE,
     LOCATION,
     COST_VALUE => COST_PRICE * QUANTITY,
     SALES_VALUE => SALES_PRICE * QUANTITY
 
 filter
     LOCATION =~ /^10/
 
 sort by
     LOCATION

 group by
     LOCATION

 output section
     string LOCATION            LOCATION,
     string PRODUCT             PRODUCT,
     string DESCRIPTION         DESCRIPTION,
     numeric QUANTITY           QUANTITY,
     decimal COST_VALUE         sum COST_VALUE,
     decimal SALES_VALUE        sum SALES_VALUE,
     decimal SALES_VALUE_RETAIL sum SALES_VALUE where SALES_CODE eq 'R',
     decimal SALES_VALUE_WSALE  sum SALES_VALUE where SALES_CODE eq 'W',
     decimal SALES_VALUE_OTHER  sum SALES_VALUE where SALES_CODE ne 'R' and SALES_CODE ne 'W'

Derived Fields Based On Output Fields

An output derived field, the calculation of which is based on output fields, can be created by declaring an output field with the = calulation expression.

 options
 
 input section
     PRODUCT,
     COST_PRICE,
     DESCRIPTION,
     SALES_CODE,
     SALES_PRICE,
     QUANTITY,
     SALES_DATE,
     LOCATION,
     COST_VALUE => COST_PRICE * QUANTITY,
     SALES_VALUE => SALES_PRICE * QUANTITY
 
 filter
     LOCATION =~ /^10/
 
 sort by
     LOCATION

 group by
     LOCATION

 output section
     string LOCATION            LOCATION,
     string PRODUCT             PRODUCT,
     string DESCRIPTION         DESCRIPTION,
     numeric QUANTITY           QUANTITY,
     numeric TOTAL_QUANTITY     sum QUANTITY,
     decimal COST_VALUE         sum COST_VALUE,
     decimal SALES_VALUE        sum SALES_VALUE,
     decimal SALES_VALUE_RETAIL sum SALES_VALUE where SALES_CODE eq 'R',
     decimal SALES_VALUE_WSALE  sum SALES_VALUE where SALES_CODE eq 'W',
     decimal SALES_VALUE_OTHER  sum SALES_VALUE where SALES_CODE ne 'R' and SALES_CODE ne 'W',
     decimal AVG_SALES_VALUE    = SALES_VALUE / TOTAL_QUANTITY

Note

In order to protect against a divide by zero exception, the AVG_SALES_VALUE field would actually be better declared as follows. This form uses a Perl alternation ?: operator. If TOTAL_QUANTITY is zero, it will set AVG_SALES_VALUE to zero, otherwise it will set AVG_SALES_VALUE to SALES_VALUE / TOTAL_QUANTITY. Thus, the division will only be performed on non-zero TOTAL_QUANTITY.

     decimal AVG_SALES_VALUE    = TOTAL_QUANTITY == 0 ? 0.0 : SALES_VALUE / TOTAL_QUANTITY

Create Intermediate (Transparent) Output Fields

In the previous example, supposing that the TOTAL_QUANTITY field was not required in the output, it could be made transparent by declaring it with an underdash (_) prefix. Transparent output fields are usefull for creating intermediate fields required for calculations.

     numeric _TOTAL_QUANTITY    sum QUANTITY,
     decimal AVG_SALES_VALUE    = SALES_VALUE / _TOTAL_QUANTITY

Cleaning Data

Data can be cleaned in a variety of ways, and invalid records placed in a reject file. The following example determines the validity of a record by a) the length of certain fields, and b) the content of field QUANTITY. The PRODUCT and LOCATION fields must be at least 8 and 2 characters long, respectively; the QUANTITY field must contain only numeric digits, decimal point and minus sign. The rejected records will be placed in the reject file called scriptname.reject

 options
     transfer
 
 input section
     PRODUCT,
     COST_PRICE,
     DESCRIPTION,
     SALES_CODE,
     SALES_PRICE,
     QUANTITY,
     SALES_DATE,
     LOCATION
 
 reject
     length(PRODUCT) < 8 || length(LOCATION) < 2,
     QUANTITY !~ /^[0-9\.\-]+$/

Converting Data

Any sort of data conversion can be performed. These include, converting from one data type to another, reformatting, case change, splitting a field into two or more fields, combining two or more fields into one field, converting date fields from one date format to another, padding, etc. The following script demonstrates these data conversions.

 options
 
 input section
     PRODUCT,
     COST_PRICE,
     DESCRIPTION,
     SALES_CODE,
     SALES_PRICE,
     QUANTITY,
     SALES_DATE,
     LOCATION
 
 output section
     string PRODUCT_U     = &uc(PRODUCT), // Convert case to upper
     string DESCRIPTION_U = &uc(DESCRIPTION), // Convert case to upper
     string PCODE_1       = &substr(PRODUCT,0,2), // Split field
     string PCODE_2       = &substr(PRODUCT,2,4), //  ""
     string ANALYSIS_1    = SALES_CODE . sprintf("%08d", COST_PRICE), // Combine fields
     string S_QUANTITY    = sprintf("%08d", QUANTITY) // Reformat/Convert field
     string NEW_PRODUCT   = PCODE_2 . PCODE_1 . &substr(PRODUCT,6) // Reformat
     decimal SALES_PRICE  SALES_PRICE // no change
     decimal SALES_CODE   SALES_CODE  // no change
     string LOCATION      LOCATION // no change

Using Date Fields

TBC

Counting Records

TBC

Extracting n Distinct Values For A Field

TBC

Tabulating Data

TBC

Statistical Analysis

TBC

Declaring And Using Tables For Value Lookup

TBC

Using External Tables

TBC

Using Date Fields

TBC

Create A Summary Report

TBC

Using Array Fields

TBC

Database Tables: oracle

TBC

Database Tables: sqlite

TBC

Merg Database Tables

TBC

View The Generated Perl Code

To view the generated Perl code use the Pequel -viewcode option:

pequel -viewcode scriptname.pql | more

Dump The Generated Perl Code

To dump the generated Perl code use the Pequel -dumpcode option. This will save the generated Perl program in the file with the name script_name.2.code. So, if your script is called myscript.pql the resulting generated Perl program will be saved in the the file myscript.pql.2.code, in the same path:

pequel -dumpcode scriptname.pql

Produce The Script Specification Document

Use the Pequel -pequeldoc pdf option to produce a presentation script specification for the Pequel script. The generated pdf document will be saved in a file with the same name as the script but with the file extension changed from pql to pdf.

pequel scriptname.pql -pequeldoc pdf

Use the -detail option to include the generated code in the document.

pequel scriptname.pql -pequeldoc pdf -detail

Display Summary Information For Script

This options will display the parsed details from the script in a summarised format.

pequel scriptname.pql -list

COMMAND LINE OPTIONS

--prefix, --prefix_path: Prefix for filenames directory path
--verbose, --ver: Display progress counter
--noverbose, --silent, --quite: Do not progress counter
--input_file, --is, --if, --i: Input data filename
--usage: Display command usage description
--output_file, --os, --of, --o: Output data filename
--script_name, --script, --s, --pql: Script filename
--header: Write header record to output.
--pequeldoc, --doc: Generate pod / pdf pequel script Reference Guide.
--viewcode, --vc: Display the generated Perl code for pequel script
--dumpcode, --dc, --diag: Dump the generated Perl code for pequel script
--syntax_check, --c, --check: Check the pequel script for syntax errors
--version, --v: Display Pequel Version information
--table_info, --ti: Display Table information for all tables declared in the pequel script
cpp_cmd, cpp_args: Override the default cpp command name and any additional agruments required.

PEQUEL LANGUAGE REFERENCE

A Pequel script is divided into sections. Each section begins with a section name, which appears on a line on its own, followed by a list of items. Each item line must be terminated by a newline comma (or both). In order to split an item line into mutiple lines (for better readability) use the line continuation character \.

Pequel is event driven. Each section within an Pequel script describes an event. For example, the input section is activated whenever an input record is read; the output section is activated whenever an aggregation is performed.

The sections must appear in the order described below. A minimal script must contain input section and output section, or, input section and transfer option. All other sections are optional, and need only appear in the Pequel script if they contain statements.

The main sections are input section and output section. The input section defines the format, in fields, of the input data stream. It can also define new calculated (derived) fields. The output section defines the format of the output data stream. The output section is required in order to perform aggregation. The output section will consist of input fields, aggregations based on grouping the input records, and new calculated fields.

Input sorting can be specified with the sort by section. Break processing (grouping) can be specified with the group by section. Input filtering is specified with the filter section. Groups of records can be filtered with the having section.

A powerfull feature of Pequel is its built-in tables feature. Tables, consisting of key and value pairs. Tables are used to perform merge and joins on multiple input datasources. They can also be used to access external data for cross referencing, and value lookups.

Pequel also handles a number of date field formats. The &date() macro provides access to date fields.

Comments

Any text following and including the # symbol or // is considered as comment text. If the cpp preprocessor is available then comments are limited to C style comments with (// and /* ... */) -- the # becomes a macro directive.

Statement Line Continuation

Each item within a section must appear on a single line. In order to break up an item statement (for better readability) us the line continuation character \.

Pre Processor

If your system provides the cpp preprocessor, your Pequel script may include any C/C++ style macros and defines.

Section Types

options: Specify properties.
description: This section contains free-format text to describe the function of the script.
use package: Specify any external Perl package modules.
input section: The items within this section consist of input data stream field names followed by any derived field definitions.
field preprocess: Specify any input field pre-processing which will occur before the field is referenced by any derived field.
filter: The filter section specifies one or more condition item statements which will be used to match incoming data records and filter out any records that do not match all the condition item statements.
reject
divert input record: If the input record matches any of the condition item statements then divert the record to the specified Pequel process or file.
copy input record: If the input record matches any of the condition item statements then copy the record to the specified Pequel process or file.
display message on input: If the input record matches any of the condition item statements then display the specified message to stderr.
display message on input abort: If the input record matches any of the condition item statements then display the specified message to stderr then exit the process.
sort by: The sort by section contains a list of input field items with optional type and sort order specifications. These fields specify the sort ordering for the input data stream.
group by: The group by section contains a list of input field items with optional type specification. These fields specify the grouping requirements for the input data stream.
dedup on
output section: The output section contains a list of output field definitions.
field postprocess: Specify any output field post-processing.
having: The having section specifies one or more condition item statements which will be used to match output data records and filter out any records that do not match all the condition item statements.
divert output record: If the output record matches any of the condition item statements then divert the record to the specified Pequel process or file.
copy output record: If the output record matches any of the condition item statements then copy the record to the specified Pequel process or file.
display message on output: If the output record matches any of the condition item statements then display the specified message to stderr.
display message on output abort: If the output record matches any of the condition item statements then display the specified message to stderr then exit the process.
init table: Initialise local tables.
load table: Load and initialise external tables.
load table pequel: Load table from output of external Pequel script.

OPTIONS SECTION

This section is used to declare various options described in detail below. Options define the overall character of the data transformation.

Format

options <option> [ (<arg>) ] [, ...]

Example

 options
     input_delimiter(\s+), # one or more space(s) delimit input fields.
     verbose(100000), # print progress on every 100000'th input record.
     optimize,
     varnames,
     default_date_type(DD/MM/YY),
     nonulls,
     diag

verbose

Set the verbose option to display progress information to STDERR during the transform run. Requires one parameter. This will instruct Pequel to display a counter message on specified number of records read from input.

silent

Supress all processing messages to stderr.

prefix

Specify a prefix path. The prefix will be used with all external file names unless the name starts with a '/'.

input_delimiter

Specify the character that is used to delimit columns in the input data stream. This is usually the pipe | character, but can be any character including the space character. For multiple spaces use \s+, and for multiple tabs use \t+. This input delimiter will default to the pipe character if input_delimiter is not specified.

output_delimiter

Specify the character that will delimit columns in the output. The output delimiter will default to the input delimiter if not specified. Refer to input_delimiter above for more information regarding types of delimiters.

discard_header

If the input data stream contains an initial header record then this option must be specified in order to discard this record from the processing.

input_file

Specify the file name as a parameter. If specified, the input data will be read from this file; otherwise it will be read from STDIN. If the input_file option contains a Pequel script name (anyting ending in .pql) then the output from executin this input script will be chained to produce the input data stream.

output_file

Specify the file name as a parameter. If specified, the output will be written to this file (the file will be overwritten!); otherwise it will be sent to STDOUT.

transfer

Copy the input record to output. The input record is copied as is, including calculated fields, to the output record. Fields specified in the output section are placed after the input fields. The transfer option is not available when group by us in use.

hash

Use hash processing mode. Hash mode is only available when break processing is activated with 'group by'. In hash mode input data need not be sorted. Because this mode of processing is memory intensive, it should only be used when generating a small number of groups. The optional 'numeric' modifier can be specified to sort the output numerically; if not specified, a string sort is done.

header

If specified then an initial header record will by written to output. This header record contains the output field names. By default a header record will be output if neither header nor noheader is specified.

noheader

Specify this option to suppress writing of header record.

addpipe

Specify this option to add an extra delimiter character after the last field. This is the default action if neither addpipe nor noaddpipe is specified.

noaddpipe

Specify this option to suppress adding an extra delimiter character after the last field.

optimize

If specified the generated Perl code will be optimized to run more efficiently. This optimisation is done by grouping similar where conditions into if-else blocks. Thus if a number of where clauses contain the same condition, these statements will be grouped under one if condition. The optimize option should only be used by users with some knowledge of Perl.

nooptimize

Specify this option to prevent code from being optimised. This is the default setting.

nulls

If specified, numeric and decimal values with a zero/null value will be output as null character. This is the default setting.

nonulls

If specified, numeric and decimal values with a zero/null value will be output as 0.

reject_file

Use this option to specify a file name to contain the rejected records. These are records that are rejected by the filter specified in the reject section. If no reject file option is specified then the default reject file name is the script file name with .reject appended.

dumpcode

Set this option to save the generated code in scriptname.2.code files. The scriptname.2.code file contains the generated perl code. This latter contains the actual Perl program that will process the input data stream. This generated Perl program can be executed independatly of Pequel.

default_date_type

Specify a default date type. Currently supported date types are: YYYYMMDD, YYMMDD, DDMMYY, DDMMMYY, DDMMYYYY, DD/MM/YY, DD/MM/YYYY, and US date formats: MMDDYY, MMDDYYYY, MM/DD/YY, MM/DD/YYYY. The DDMMMYY format refers to dates such as 21JAN02.

default_list_delimiter

Specify the default list delimiter for array fields created by values_all and values_uniq aggregates. Any delimiter specified as a parameter to the aggregate function will override this.

rmctrlm v3

If the input file is in DOS format, specify 'rmctrlm' option to remove the Ctrl-M at end of line.

input_record_limit v3

Specify number of records to process from input file. Processing will stop after the number of records as specified have been read.

suppress_output v3

Use this option when summary section is used to prevent output of raw results.

pequeldoc

Generate PDF for Programmer's Reference Manual for the Pequel script. The next three options are also required.

doc_title

Specify the title that will appear on the pequeldoc generated manual.

doc_email

Specify the user's email that will appear on the pequeldoc generated manual.

doc_version

Specify the Pequel script version number that will appear on the pequeldoc generated manual.

gzcat_cmd, gzcat_args

Override the default gzcat command name and any additional agruments required.

cat_cmd, cat_args

Override the default cat command name and any additional agruments required.

sort_cmd, sort_args

Override the default sort command name and any additional agruments required.

pack_output, output_pack_fmt

The output data stream can be packed using the format specified in the output_pack_fmt. These properties can also be used to produce fixed format and binary output. The default format is A3/Z* repeated for each output field. Please refer to the Perl perlpacktut manual for a detailed desctiption of formats.

unpack_input, input_pack_fmt

The packed input data stream can be unpacked using the format specified in the unput_pack_fmt. These properties can also be used to input fixed format and binary input. The default format is A3/Z* repeated for each input field. Please refer to the Perl perlpacktut manual for a detailed desctiption of formats.

INLINE OPTIONS

The following options require that the Inline::C Perl module and a C compiler system is installed on your system.

use_inline

The use_inline option will instruct Pequel to generate (and compile/link) C code -- replacing the input file identifier inside the main while loop by a readsplit() function call. The readsplit function is implemented in C.

input_delimiter_extra

Specify one or more extra field delimiter characters. These may be one of any quote character, ', ", `, and optionally, one of and bracket character, {, [, (. For example, this option can be used to parse input Apache log files in CLF format:

 options
     input_delimiter_extra("[)  // Apache CLF log quoted fields and bracketed timestamp

inline_clean_after_build

Tells Inline to clean up the current build area if the build was successful. Sometimes you want to DISABLE this for debugging. Default is 1.

inline_clean_build_area

Tells Inline to clean up the old build areas within the entire Inline DIRECTORY. Default is 0.

inline_print_info

Tells Inline to print various information about the source code. Default is 0.

inline_build_noisy

Tells ILSMs that they should dump build messages to the terminal rather than be silent about all the build details.

inline_build_timers

Tells ILSMs to print timing information about how long each build phase took. Usually requires Time::HiRes

inline_force_build

Makes Inline build (compile) the source code every time the program is run. The default is 0.

inline_directory

The DIRECTORY config option is the directory that Inline uses to both build and install an extension.

Normally Inline will search in a bunch of known places for a directory called '.Inline/'. Failing that, it will create a directory called '_Inline/'

If you want to specify your own directory, use this configuration option.

Note that you must create the DIRECTORY directory yourself. Inline will not do it for you.

inline_CC

Specify which compiler to use.

inline_OPTIMIZE

This controls the MakeMaker OPTIMIZE setting. By setting this value to '-g', you can turn on debugging support for your Inline extensions. This will allow you to be able to set breakpoints in your C code using a debugger like gdb.

inline_CCFLAGS

Specify extra compiler flags.

inline_LIBS

Specifies external libraries that should be linked into your code.

inline_INC

Specifies an include path to use. Corresponds to the MakeMaker parameter.

inline_LDDLFLAGS

Specify which linker flags to use.

NOTE: These flags will completely override the existing flags, instead of just adding to them. So if you need to use those too, you must respecify them here.

inline_MAKE

Specify the name of the 'make' utility to use.

USE PACKAGE SECTION

Use this section to specify Perl packages to use. This section is optional.

Format

use package <Perl package name> [, ...]

Examples

 use package
     Benchmark,
     EasyDate

INIT TABLE SECTION

Use init table to initialise tables in the Pequel script. This will consist of a list of table name followed by key value (or value list) pairs. The key must not contain any spaces. In order to avoid clutter in the script, use load table as described above. To look up a table key/value use the %table name(key) syntax. Table column values are accessed by using the %table name(key)-=>n syntax, when n refers to a column number starting from '1'. The column specification is not required for single value tables. All entries within a table should have the same number of values, empty values can be declared with a null quoted value (''). This section is optional.

Format

init table <table> <key> <value> [, <value>...]

Example

 init table
 // Table-Name Key-Value Field->1             Field-2  Field-3
    LOCINFO    NSW       'New South Wales'    '2061'   '02'
    LOCINFO    WA        'Western Australia'  '5008'   '07'
    LOCINFO    SA        'South Australia'    '8078'   '08'

 input section
    LOCATION,
    LDESCRIPT => %LOCINFO(LOCATION)->1 . " in postcode " . %LOCINFO(LOCATION)->2

LOAD TABLE SECTION

Use this section to declare tables that are to be initialised from an external data file. If the table is in .tbl format (key|value) then only the table name (without the .tbl) need be specified. The filename can consist of the full path name. Compressed files (ending in .gz, .z, .Z, .zip) will be handled properly. If key column is not specified then this is set to 1 by default; if the value column is not specified then this is set to 2 by default. Column numbers are 1 base. To look up a table key/value use the %table name(key) syntax. If the table name is prefixed with the _ character, this table will be loaded at runtime instead of compile time. Thus the table contents will not appear in the generated code. This is useful if the table contains more than a few hundred entries, as it will not clutter up the generated code.

persistant option

The persistant option will make the table disk-based instead of memory-based. Use this option for tables that are too big to fit in available memory. The disk-based table snapshot file will have the name _TABLE_name.dat, where name is the table name. When the persistant option is used, the table is generated only once, the first time it is used. Thereafter it will be loaded from the snaphot file. This is alot quicker and therefore usefull for large tables. In order to re-generate the table, the snapshot file must be manually deleted. In order to use the persistant option the Perl DB_File module must be available. The effect of persistant is to tie the table's associative array with a DBM database (Berkeley DB). Note that using persistant tables will downgrade the overall performance of the script.

Format

load table [ persistant ] <table> [ <filename> [ <key_col> [ <val_col> ] ] ] [, ...]

Examples

 load table
     POSTCODES
     MONTH_NAMES /data/tables/month_names.tbl
     POCODES pocodes.gz 1 2
     ZIPSAMPLE zipsample.txt 3 21

INPUT SECTION

This section defines the format of the input data stream. Any calculated fields must be placed after the last input field. The calculation expression must begin with => and consists of (almost) any valid Perl statement, and can include input field names. All macros are also available to calculation expressions. The input section must appear before all the sections described below. Each input field name must be unique.

Format

input section <input field name> [ => <calculation expression> ] [, ...]

Example

 input section
     ACL,
     AAL,
     ZIP,
     CALLDATE,
     CALLS,
     DURATION,
     REVENUE,
     DISCOUNT,
     KINSHIP_KEY,
     INV => REVENUE + DISCOUNT,
     MONTH_CALLDATE => &month(CALLDATE),
     GROUP => MONTH_CALLDATE <= 6 ? 1 : 2,
     POSTCODE => %POSTCODES(AAL),
     IN_SAMPLE => exists %ZIPSAMPLE(ZIP),
     IN_SAMPLE_2 => exists %ZIPSAMPLE(ZIP) ? 'yes': 'no'

FIELD PREPROCESS SECTION

Use this section to perform addition formatting/processing on input fields. These statements will be performed right after the input record is read and before calculating the input derived fields.

FIELD POSTPROCESS SECTION

Use this section to perform addition formatting/processing on output fields. These statements will be performed after the aggregations and just prior to the output of the aggregated record.

SORT BY SECTION

Use this section to sort the input data by field(s). One or more sort fields can be specified. This section must appear after the input section and before the group by and output sections. The numeric option is used to specify a numeric sort, and the desc option is used to specify a descending sort order. The standard Unix sort command is used to perform the sort. The numeric option is translated to the -n Unix sort option; the desc option is translated to the -r Unix sort option. If the input data is pre sorted then the sort by section is not required (even if break processing is activated with a group by section declaration). The sort by section is not required when the hash option is specified.

Format

sort by <field name> [ numeric ] [ desc ] [, ...]

Examples

 sort by
     ACL,
     AAL numeric desc

REJECT SECTION

Specify one or more filter expressions. Filter expression can consist of any valid Perl statement, and must evaluate to Boolean true or false (0 is false, anything else is true). It can contain input field names and macros. Each input record is evaluated against the filter(s). Records that evaluate to true on any one filter will be rejected and written to the reject file. The reject file is named scriptname.reject unless specified in the reject_file option.

Format

reject <filter expression> [, ...]

Examples

 reject
     !exists %ZIPSAMPLE(ZIP)
     INV < 200

Specify one or more filter expressions. Filter expression can consist of any valid Perl statement, and must evaluate to Boolean true or false. It can contain input field names and macros. Each input record is evaluated against the filter(s). Only records that evaluate to true on all filter statements will be processed; that is, records that evaluate to false on any one filter statement will be discarded.

Format

filter <filter expression> [, ...]

Examples

 filter
     exists %ZIPSAMPLE(ZIP)
     ACL =~ /^356/
     ZIP eq '52101' or ZIP eq '52102'

GROUP BY SECTION

Use this section to activate break processing. Break processing is required to be able to use the aggregates in the output section. One or more fields can be specified - the input data must be sorted on the group by fields, unless the hash option is used. A break will occur when any of the group field values changes. The group by section must appear after the sort by section and before the output section. The numeric option will cause leading zeros to be stripped from the input field. Group by on calculated input fields is usefull when the hash option is in use because the input does not need to be pre-sorted.

Format

group by <input field name> [ numeric | decimal | string ] [, ...]

Examples

 group by
     AAL,
     ACL numeric

DEDUP ON SECTION

OUTPUT SECTION

This is where the output data stream format is specified. At least one output field must be defined here (unless the transfer option is specified). Each output field definition must end with a comma or new line (or both). Each field definition must begin with a type (numeric, decimal, string, date). The output field name can be the same as an input field name, unless the output field is a calculated field. Each output field name must be unique. This name will appear in the header record (if the header option is set). The aggregate expression must consist of at least the input field name.

The aggregates sum, min, max, avg, first, last, distinct, values_all, and values_uniq must be followed by an input field name. The aggregates count and flag must be followed by the * character. The aggregate serial must be followed by a number (indicating the serial number start).

A prefix of _ in the output field name causes that field to be transparent; these fields will not be output, their use is mainly for intermediate calculations. <input field name> can be any field declared in the input section, including calculated fields. This section is required unless the transfer option is specified.

Format

output section <type> <output field name> <output expression> [, ...]

<type>

numeric, decimal, string, date [ (<datefmt>) ]

Each output field name must be unique. Output field name can be the same as the input field name, unless the output field is a calculated field. A _ prefix denotes a transparent field. Transparent fields will not be output, they are used for intermediate caclulations.

YYYYMMDD, YYMMDD, DDMMYY, DDMMMYY, DDMMYYYY, DD/MM/YY, DD/MM/YYYY, MMDDYY, MMDDYYYY, MM/DD/YY, MM/DD/YYYY

<aggregate> <input field name> [ where <condition expression> ]

serial <start num> [ where <condition expression> ]

count * [ where <condition expression> ]

flag * [ where <condition expression> ]

= <calculation expression> [ where <condition expression> ]

| sum_distinct | avg_distinct | count_distinct

| values_all [ (<delim>) ] | values_uniq [ (<delim>) ]

Any field specified in the input section.

Any valid Perl expression, including input and output field names, and Pequel macros. This expression can consist of numeric calculations, using arithmetic operators (+, *, -, etc) and functions (abs, int, rand, sqrt, etc.), string calculations, using string operators (eg. . for concatenation) and functions (uc, lc, substr, length, etc.).

Any valid Perl expression, including input and output field names, and Pequel macros, that evaluates to true (non-zero) or false (zero).

Aggregates

sum <input field>

Accumulate the total for all values in the group. Output type must be numeric, decimal or date.

sum_distinct <input field>

Accumulate the total for distinct values only in the group. Output type must be numeric, decimal or date.

maximum | max <input field>

Output the maximum value in the group. Output type must be numeric, decimal or date.

minimum | min <input field>

Output the minimum value in the group. Output type must be numeric, decimal or date.

avg | mean <input field>

Output the average value in the group. Output type must be numeric, decimal or date.

avg_distinct <input field>

Output the average value for distinct values only in the group. Output type must be numeric, decimal or date.

first <input field>

Output the first value in the group.

last <input field>

Output the last value in the group.

count_distinct | distinct <input field>

Output the count of unique values in the group. Output type must be numeric.

median <input field>

The median is the middle of a distribution: half the scores are above the median and half are below the median. When there is an odd number of values, the median is simply the middle number. When there is an even number of values, the median is the mean of the two middle numbers. Output type must be numeric.

variance <input field>

Variance is calculated as follows: (sum_squares / count) - (mean ** 2), where sum_squares is each value in the distribution squared (** 2); count is the number of values in the distribution; mean is discussed above. Output type must be numeric.

stddev <input field>

Stddev is calculated as the square-root of variance. Output type must be numeric.

range <input field>

The range is the maximum value minus the minimum value in a distribution. Output type must be numeric.

mode <input field>

The mode is the most frequently occuring score in a distribution and is used as a measure of central tendency. A distribution may have more than one mode, in which case a space delimited list is returned. Any output type is valid.

values_all <input field>

Output the list of all values in the group. The specified delimiter delimits the list. If not specified then the default_list_delimiter specified in options is used.

values_uniq <input field>

Output the list of unique values in the group. The specified delimiter delimits the list. If not specified then the default_list_delimiter specified in options is used.

serial <n>

Output the next serial number starting from n. The serial number will be incremented by one for each successive output record. Output type must be numeric.

count *

Output the count of records in the group. Output type must be numeric.

flag *

Output 1 or 0 depending on the result of the where condition clause. If no where clause is specified then the output value is set to 1. The output will be set to 1 if the where condition evaluates to true at least once for all records within the group. Output type must be numeric.

corr <input field>

New in v2.5. Returns the coefficient of correlation of a set of number pairs.

covar_pop <input field>

New in v2.5. Returns the population covariance of a set of number pairs.

covar_samp <input field>

New in v2.5. Returns the sample covariance of a set of number pairs.

cume_dist <input field>

New in v2.5. Calculates the cumulative distribution of a value in a group of values.

dense_rank <input field>

New in v2.5. Computes the rank of a row in an ordered group of rows.

rank <input field>

New in v2.5. Calculates the rank of a value in a group of values.

= <calculation expression>

Calculation expression follows. Use this to create output fields that are based on some calculation expression. The calculation expression can consist of any valid Perl statement, and can contain input field names, output field names and macros.

Examples

 output section
     numeric AAL            AAL
     string  _HELLO         = 'HELLO'
     string  _WORLD         = 'WORLD'
     string  HELLO_WORLD    = _HELLO . ' ' . _WORLD
     decimal _REVENUE       sum REVENUE
     decimal _DISCOUNT      sum DISCOUNT
     decimal INVOICE        = _REVENUE + _DISCOUNT

HAVING SECTION

The having section is applied after the grouping performed by group by, for filtering groups based on the aggregate values. Break processing must be activated using the group by section. The having section must appear after the output section. Specify one or more filter expressions. Filter expression can consist of any valid Perl statement, and must evaluate to Boolean true or false. It can contain input field names, output field names and macros. Only groups that evaluate to true on all filter statements will be output; that is, groups that evaluate to false on any one filter statement will be discarded. Each filter statement must end with a comma and/or new line.

Format

having <filter expression> [, ...]

Examples

 having
     SAMPLE == 1
     MONTH_1_COUNT > 2 and MONTH_2_COUNT > 2

SUMMARY SECTION

This section contains any perl code and will be executed once after all input records have been processed. Input, output field names, and macros can be used here. This section is mostly relevant when group by is omitted, so that a group all is in effect. The suppress_output option should also be used. If the script contains a group by section and more than one group of records is produced, only the last group's values will appear in the summary section.

Format

summary section < Perl code >

Examples

 summary section
     print "*** Summary Report ***";
     print "Total number of Products:   ", sprintf("%12d", COUNT_PRODUCTS);
     print "Total number of Locations:  ", sprintf("%12d", COUNT_LOCATIONS);
     print "*** End of report ***";

GENERATED PROGRAM OUTLINE

Open Input Stream
Load/Connect Tables
Read Next Input Record
Output Aggregated Record If Grouping Key Changes
Calculate Derived Input Fields
Perform Aggregations
Process Outline:

282 POD Errors

The following errors were encountered while parsing the POD:

Around line 37:

'=item' outside of any '=over'

Around line 115:

Unknown directive: =page

Around line 117:

You forgot a '=back' before '=head1'

Around line 120:

'=item' outside of any '=over'

Around line 158:

Unknown directive: =page

Around line 160:

You forgot a '=back' before '=head1'

Around line 170:

=begin without a target?

Around line 187:

'=end' without a target?

Around line 236:

Unknown directive: =page

Around line 246:

=begin without a target?

Around line 264:

'=end' without a target?

Around line 276:

=begin without a target?

Around line 298:

'=end' without a target?

Around line 310:

=begin without a target?

Around line 337:

'=end' without a target?

Around line 349:

=begin without a target?

Around line 382:

'=end' without a target?

Around line 394:

=begin without a target?

Around line 430:

'=end' without a target?

Around line 442:

=begin without a target?

Around line 478:

'=end' without a target?

Around line 490:

=begin without a target?

Around line 528:

'=end' without a target?

Around line 536:

=begin without a target?

Around line 540:

'=end' without a target?

Around line 548:

=begin without a target?

Around line 553:

'=end' without a target?

Around line 563:

=begin without a target?

Around line 582:

'=end' without a target?

Around line 594:

=begin without a target?

Around line 620:

'=end' without a target?

Around line 737:

Unknown directive: =page

Around line 741:

'=item' outside of any '=over'

Around line 819:

Unknown directive: =page

Around line 821:

You forgot a '=back' before '=head1'

Around line 862:

'=item' outside of any '=over'

Around line 979:

You forgot a '=back' before '=head2'

Around line 984:

'=item' outside of any '=over'

Around line 992:

=begin without a target?

Around line 1003:

'=end' without a target?

Around line 1178:

You forgot a '=back' before '=head2'

Around line 1183:

'=item' outside of any '=over'

Around line 1192:

=begin without a target?

Around line 1200:

'=end' without a target?

Around line 1282:

You forgot a '=back' before '=head2'

Around line 1287:

'=item' outside of any '=over'

Around line 1295:

=begin without a target?

Around line 1301:

'=end' without a target?

Around line 1305:

You forgot a '=back' before '=head2'

Around line 1310:

'=item' outside of any '=over'

Around line 1318:

=begin without a target?

Around line 1330:

'=end' without a target?

Around line 1334:

You forgot a '=back' before '=head2'

Around line 1339:

'=item' outside of any '=over'

Around line 1352:

=begin without a target?

Around line 1360:

'=end' without a target?

Around line 1365:

You forgot a '=back' before '=head2'

Around line 1370:

'=item' outside of any '=over'

Around line 1378:

=begin without a target?

Around line 1397:

'=end' without a target?

Around line 1401:

You forgot a '=back' before '=head2'

Around line 1420:

'=item' outside of any '=over'

Around line 1428:

=begin without a target?

Around line 1434:

'=end' without a target?

Around line 1438:

You forgot a '=back' before '=head2'

Around line 1443:

'=item' outside of any '=over'

Around line 1451:

=begin without a target?

Around line 1457:

'=end' without a target?

Around line 1461:

You forgot a '=back' before '=head2'

Around line 1466:

'=item' outside of any '=over'

Around line 1474:

=begin without a target?

Around line 1481:

'=end' without a target?

Around line 1485:

You forgot a '=back' before '=head2'

Around line 1490:

'=item' outside of any '=over'

Around line 1498:

=begin without a target?

Around line 1504:

'=end' without a target?

Around line 1508:

You forgot a '=back' before '=head2'

Around line 1523:

'=item' outside of any '=over'

Around line 1605:

You forgot a '=back' before '=head2'

Around line 1607:

'=item' outside of any '=over'

Around line 1739:

=begin without a target?

Around line 1750:

'=end' without a target?

Around line 1755:

You forgot a '=back' before '=head2'

Around line 1760:

'=item' outside of any '=over'

Around line 1768:

=begin without a target?

Around line 1774:

'=end' without a target?

Around line 1778:

You forgot a '=back' before '=head2'

Around line 1783:

'=item' outside of any '=over'

Around line 1791:

=begin without a target?

Around line 1799:

'=end' without a target?

Around line 1803:

Unknown directive: =page

Around line 1805:

You forgot a '=back' before '=head1'

Around line 1808:

'=item' outside of any '=over'

=over without closing =back

Around line 1859: