PEQUEL - Pequel User Guide
Pequel is a comprehensive system for data file processing and transformation. It features a simple, user-friendly event driven scripting interface that transparently generates, builds and executes highly efficient data-processing programs. By using the Pequel scripting language, the user can create and maintain complex data transformation processes quickly, easily, and accurately. Incidentally, the name pequel is derived from perl'ish sequel.
The Pequel system can be used by both technical (programmers) and non-technical end users. For non-technicasl users the Pequel scripting language is simple to learn and Pequel will transperantly generate, build and execute the transformation process. For developers the generated transformation program can be examined and extended, though this is rarely necessary as the scripting language contains constructs that are powerfull enough to handle even the most complex tranformation process. A Perl module Pequel.pm is provided for developers which will allow the creation of Pequel processes within Perl programs.
The Pequel scripting language is both simple and powerfull. It is event driven with each event defining a specific stage in the overall transformation process. Each event section is filled in systematically by a list of items. These items can be condition statements, field names, property settings, aggregation statements, calculation statements, and so on. A full and comprehensive array of aggregates and macros are available, as well as full Perl regular expressions within statements.
Pequel generates highly efficient Perl and C code. The generated code is as efficient as hand-written code. The emphasis in the generated code is performance -- to process maximum records in minimum time. The generated code can be dumped into a program file and executed independently of Pequel.
The Pequel scipt is self-documenting via pequeldoc. Pequel will automatically generate the Pequel Script Programmer's Reference Manual in pdf format. This manual contains detailed and summarised information about the script, and includes cross-reference information. It will also contain an optional listing of the generated program.
Pequel is installed as a Perl module.
Pequel currently supports the following incoming data stream formats: variable length delimited, CVS, fixed length, Apache CLF, and anything else that Perl pack/unpack can handle.
Pequel has a multitude of uses:
Use Pequel to output selected columns from an input data stream.
Output selected records based on filtering conditional statements. Full Perl regular expressions are available.
Derive new columns using simple to complex expressions. Perform calculations on input fields to generate new (derived) fields, using Perl expressions. Calculations can be performed on both numeric fields (mathematical) and string fields (such as concatenation, substr, etc).
Records with similar characteristics can be grouped together. Calculate aggregations, such as max, min, mean, sum, and count, on grouped record sets.
Grouping can be performed in memory on unsorted input data using the hash option.
hash
Pequel provides a comprehensive array of statistical aggregate functions.
Pequel can be effectively used for checking and resolving invalid data.
TBD
Convert data using any of the built-in macros and Perl regular expressions. Perform any kind of data conversion. These include, converting from one data type to another, reformatting, case change, splitting a field into two or more fields, combining two or more fields into one field, converting date fields from one date format to another, padding, etc.
Data can be distributed based on conditions to multiple Pequel processes.
Data output from multiple Pequel processes can be combined into the incoming data stream.
Data from any number of external files or other Pequel processes can be merged via the Pequel tables facility.
The output from one Pequel process can be piped into a second Pequel process simply by specifying the first script name as the input_file property for the second script.
Pequel supports array fields and provides a comprehensive set of array macros to manipulate or generate array fields.
Direct access to database (Oracle, Sqlite, etc) tables via the Pequel table facility. Pequel will generate low level database API code. Currently supported databases are Oracle (via OCI), and Sqlite.
Execute pequel with scriptfile.pql script to process file_in data file, resulting in file_out. The scriptfile.pql will contain the transformation instructions.
Check the syntax of the pequel script scriptfile.pql.
Generate and display the code for the pequel script scriptfile.pql.
Generate the pequel code for the script scriptfile.pql and save generated code in the file scriptname.pql.2.code.
Display version informatio for Pequel.
Display Pequel usage command summary.
Generate the Script Reference document in pdf format for the Pequel script scriptfile.pql. The document will include a section showing the generated code (-detail).
Use your prefered text editor to create a pequel script myscript.pql. Syntax highlighting is available for vim with the pequel.vim syntax file (in vim/sytnax) -- copy the pequel.vim file into the syntax directory of the vim installation.
All that is required is to fill in, at least, the output section, or specify transfer option. The transfer option will have the effect of copying all input field values to the output. This is effectively a straight through process -- the resulting output is identical to the input.
options transfer input section PRODUCT, COST_PRICE, DESCRIPTION, SALES_CODE, SALES_PRICE, QUANTITY, SALES_DATE, LOCATION output section
Do a syntax check on the script by using the Pequel -c option. This should return the words myscript.pql Syntax OK.
-c
myscript.pql Syntax OK
pequel -c myscript.pql
Optionally, the generated Perl program can be dumped and viewed. The program will be dumped in a file with the same name and path as the script with a '.2.code' suffix.
pequel -dumpcode myscript.pql
Processing pequel script 'myscript.pql'...................
->myscript.pql.2.code
-
myscript.pql.2.code
If syntax check is ok, run the script -- the sample.data data file in the examples directory can be used:
pequel myscript.pql < inputdata > outputdata
pequel myscript.pql
We next do something usefull to transform the input data. Create a filter to output a subset of records, consisting of records which have LOCATION starting with 10. The filter example uses a Perl regular expression to match the LOCATION field content with the Perl regular expression =~ /^10/. This is specified in the filter section. Check and run the updated script as instructed above:
LOCATION
10
=~ /^10/
options transfer input section PRODUCT, COST_PRICE, DESCRIPTION, SALES_CODE, SALES_PRICE, QUANTITY, SALES_DATE, LOCATION filter LOCATION =~ /^10/
Create additional, derived fields based on the other input fields. In our example, two new fields are added COST_VALUE and SALES_VALUE. Derived fields must be specified in the input section after the last input field. The derived field name is followed by the => operator, and a calculation expression. Derived fields will also be output when the transfer options is specified.
COST_VALUE
SALES_VALUE
=>
options transfer input section PRODUCT, COST_PRICE, DESCRIPTION, SALES_CODE, SALES_PRICE, QUANTITY, SALES_DATE, LOCATION, COST_VALUE => COST_PRICE * QUANTITY, SALES_VALUE => SALES_PRICE * QUANTITY filter LOCATION =~ /^10/ output section
In the above examples, the output record has the same (field) format as the input record, plus the additional derived fields. In the following example we select which fields to output, and their order, on the output record. To do this we need to remove the transfer option, and create the output section. The output fields PRODUCT, LOCATION, DESCRIPTION, QUANTITY, COST_VALUE, and SALES_VALUE are specified to create a new output format. In this example, all the output field names have the same name as the input fields.
PRODUCT
DESCRIPTION
QUANTITY
options input section PRODUCT, COST_PRICE, DESCRIPTION, SALES_CODE, SALES_PRICE, QUANTITY, SALES_DATE, LOCATION, COST_VALUE => COST_PRICE * QUANTITY, SALES_VALUE => SALES_PRICE * QUANTITY filter LOCATION =~ /^10/ output section string PRODUCT PRODUCT, string LOCATION LOCATION, string DESCRIPTION DESCRIPTION, numeric QUANTITY QUANTITY, decimal COST_VALUE COST_VALUE, decimal SALES_VALUE SALES_VALUE
Records with similar characteristics can be grouped together, and aggregations can then be performed on the grouped records' data. The following example groups the records by LOCATION, and sums the COST_VALUE and SALES_VALUE fields within each group. Grouping is activated by creating a group by section. Input data must also be sorted on the grouping field(s). If the data is not pre-sorted then this needs to be done in the script by creating a sort by section. Alternatively, by specifying the hash option, the input data need not be sorted.
options input section PRODUCT, COST_PRICE, DESCRIPTION, SALES_CODE, SALES_PRICE, QUANTITY, SALES_DATE, LOCATION, COST_VALUE => COST_PRICE * QUANTITY, SALES_VALUE => SALES_PRICE * QUANTITY filter LOCATION =~ /^10/ sort by LOCATION group by LOCATION output section string LOCATION LOCATION, string PRODUCT PRODUCT, string DESCRIPTION DESCRIPTION, numeric QUANTITY QUANTITY, decimal COST_VALUE sum COST_VALUE, decimal SALES_VALUE sum SALES_VALUE
A subset of groups can be select by creating a having section. The having section is similar to the filter section, but instead is applied to the aggregated group of records. In this example we will output only records for locations which have a total SALES_VALUE of 1000 or more. Note that SALES_VALUE in the having section refers to the output field (sum SALES_VALUE) and not the input field with same name (SALES_PRICE * QUANTITY). The having section gives preference to output fields when interpreting field names.
1000
sum SALES_VALUE
SALES_PRICE * QUANTITY
options input section PRODUCT, COST_PRICE, DESCRIPTION, SALES_CODE, SALES_PRICE, QUANTITY, SALES_DATE, LOCATION, COST_VALUE => COST_PRICE * QUANTITY, SALES_VALUE => SALES_PRICE * QUANTITY filter LOCATION =~ /^10/ sort by LOCATION group by LOCATION output section string LOCATION LOCATION, string PRODUCT PRODUCT, string DESCRIPTION DESCRIPTION, numeric QUANTITY QUANTITY, decimal COST_VALUE sum COST_VALUE, decimal SALES_VALUE sum SALES_VALUE having SALES_VALUE >= 1000
Output fields can be aggregated conditionally. That is, the aggregation will only occur for records, within the group, that evaluate the condition to true. This is done by adding a where clause to the aggregate function. In this example we create three new output fields SALES_VALUE_RETAIL, SALES_VALUE_WSALE and SALES_VALUE_OTHER. These fields will contain the sales value for records within the group which have sales code equal to 'R', 'W', and other codes, respectively.
where
SALES_VALUE_RETAIL
SALES_VALUE_WSALE
SALES_VALUE_OTHER
options input section PRODUCT, COST_PRICE, DESCRIPTION, SALES_CODE, SALES_PRICE, QUANTITY, SALES_DATE, LOCATION, COST_VALUE => COST_PRICE * QUANTITY, SALES_VALUE => SALES_PRICE * QUANTITY filter LOCATION =~ /^10/ sort by LOCATION group by LOCATION output section string LOCATION LOCATION, string PRODUCT PRODUCT, string DESCRIPTION DESCRIPTION, numeric QUANTITY QUANTITY, decimal COST_VALUE sum COST_VALUE, decimal SALES_VALUE sum SALES_VALUE, decimal SALES_VALUE_RETAIL sum SALES_VALUE where SALES_CODE eq 'R', decimal SALES_VALUE_WSALE sum SALES_VALUE where SALES_CODE eq 'W', decimal SALES_VALUE_OTHER sum SALES_VALUE where SALES_CODE ne 'R' and SALES_CODE ne 'W'
An output derived field, the calculation of which is based on output fields, can be created by declaring an output field with the = calulation expression.
=
options input section PRODUCT, COST_PRICE, DESCRIPTION, SALES_CODE, SALES_PRICE, QUANTITY, SALES_DATE, LOCATION, COST_VALUE => COST_PRICE * QUANTITY, SALES_VALUE => SALES_PRICE * QUANTITY filter LOCATION =~ /^10/ sort by LOCATION group by LOCATION output section string LOCATION LOCATION, string PRODUCT PRODUCT, string DESCRIPTION DESCRIPTION, numeric QUANTITY QUANTITY, numeric TOTAL_QUANTITY sum QUANTITY, decimal COST_VALUE sum COST_VALUE, decimal SALES_VALUE sum SALES_VALUE, decimal SALES_VALUE_RETAIL sum SALES_VALUE where SALES_CODE eq 'R', decimal SALES_VALUE_WSALE sum SALES_VALUE where SALES_CODE eq 'W', decimal SALES_VALUE_OTHER sum SALES_VALUE where SALES_CODE ne 'R' and SALES_CODE ne 'W', decimal AVG_SALES_VALUE = SALES_VALUE / TOTAL_QUANTITY
Note
In order to protect against a divide by zero exception, the AVG_SALES_VALUE field would actually be better declared as follows. This form uses a Perl alternation ?: operator. If TOTAL_QUANTITY is zero, it will set AVG_SALES_VALUE to zero, otherwise it will set AVG_SALES_VALUE to SALES_VALUE / TOTAL_QUANTITY. Thus, the division will only be performed on non-zero TOTAL_QUANTITY.
AVG_SALES_VALUE
?:
TOTAL_QUANTITY
SALES_VALUE / TOTAL_QUANTITY
decimal AVG_SALES_VALUE = TOTAL_QUANTITY == 0 ? 0.0 : SALES_VALUE / TOTAL_QUANTITY
In the previous example, supposing that the TOTAL_QUANTITY field was not required in the output, it could be made transparent by declaring it with an underdash (_) prefix. Transparent output fields are usefull for creating intermediate fields required for calculations.
_
numeric _TOTAL_QUANTITY sum QUANTITY, decimal AVG_SALES_VALUE = SALES_VALUE / _TOTAL_QUANTITY
Data can be cleaned in a variety of ways, and invalid records placed in a reject file. The following example determines the validity of a record by a) the length of certain fields, and b) the content of field QUANTITY. The PRODUCT and LOCATION fields must be at least 8 and 2 characters long, respectively; the QUANTITY field must contain only numeric digits, decimal point and minus sign. The rejected records will be placed in the reject file called scriptname.reject
8
2
options transfer input section PRODUCT, COST_PRICE, DESCRIPTION, SALES_CODE, SALES_PRICE, QUANTITY, SALES_DATE, LOCATION reject length(PRODUCT) < 8 || length(LOCATION) < 2, QUANTITY !~ /^[0-9\.\-]+$/
Any sort of data conversion can be performed. These include, converting from one data type to another, reformatting, case change, splitting a field into two or more fields, combining two or more fields into one field, converting date fields from one date format to another, padding, etc. The following script demonstrates these data conversions.
options input section PRODUCT, COST_PRICE, DESCRIPTION, SALES_CODE, SALES_PRICE, QUANTITY, SALES_DATE, LOCATION output section string PRODUCT_U = &uc(PRODUCT), // Convert case to upper string DESCRIPTION_U = &uc(DESCRIPTION), // Convert case to upper string PCODE_1 = &substr(PRODUCT,0,2), // Split field string PCODE_2 = &substr(PRODUCT,2,4), // "" string ANALYSIS_1 = SALES_CODE . sprintf("%08d", COST_PRICE), // Combine fields string S_QUANTITY = sprintf("%08d", QUANTITY) // Reformat/Convert field string NEW_PRODUCT = PCODE_2 . PCODE_1 . &substr(PRODUCT,6) // Reformat decimal SALES_PRICE SALES_PRICE // no change decimal SALES_CODE SALES_CODE // no change string LOCATION LOCATION // no change
TBC
To view the generated Perl code use the Pequel -viewcode option:
-viewcode
pequel -viewcode scriptname.pql | more
To dump the generated Perl code use the Pequel -dumpcode option. This will save the generated Perl program in the file with the name script_name.2.code. So, if your script is called myscript.pql the resulting generated Perl program will be saved in the the file myscript.pql.2.code, in the same path:
-dumpcode
pequel -dumpcode scriptname.pql
Use the Pequel -pequeldoc pdf option to produce a presentation script specification for the Pequel script. The generated pdf document will be saved in a file with the same name as the script but with the file extension changed from pql to pdf.
-pequeldoc pdf
pequel scriptname.pql -pequeldoc pdf
Use the -detail option to include the generated code in the document.
-detail
pequel scriptname.pql -pequeldoc pdf -detail
This options will display the parsed details from the script in a summarised format.
pequel scriptname.pql -list
Prefix for filenames directory path
Display progress counter
Do not progress counter
Input data filename
Display command usage description
Output data filename
Script filename
Write header record to output.
Generate pod / pdf pequel script Reference Guide.
Display the generated Perl code for pequel script
Dump the generated Perl code for pequel script
Check the pequel script for syntax errors
Display Pequel Version information
Display Table information for all tables declared in the pequel script
Override the default cpp command name and any additional agruments required.
A Pequel script is divided into sections. Each section begins with a section name, which appears on a line on its own, followed by a list of items. Each item line must be terminated by a newline comma (or both). In order to split an item line into mutiple lines (for better readability) use the line continuation character \.
\
Pequel is event driven. Each section within an Pequel script describes an event. For example, the input section is activated whenever an input record is read; the output section is activated whenever an aggregation is performed.
The sections must appear in the order described below. A minimal script must contain input section and output section, or, input section and transfer option. All other sections are optional, and need only appear in the Pequel script if they contain statements.
The main sections are input section and output section. The input section defines the format, in fields, of the input data stream. It can also define new calculated (derived) fields. The output section defines the format of the output data stream. The output section is required in order to perform aggregation. The output section will consist of input fields, aggregations based on grouping the input records, and new calculated fields.
Input sorting can be specified with the sort by section. Break processing (grouping) can be specified with the group by section. Input filtering is specified with the filter section. Groups of records can be filtered with the having section.
A powerfull feature of Pequel is its built-in tables feature. Tables, consisting of key and value pairs. Tables are used to perform merge and joins on multiple input datasources. They can also be used to access external data for cross referencing, and value lookups.
Pequel also handles a number of date field formats. The &date() macro provides access to date fields.
Any text following and including the # symbol or // is considered as comment text. If the cpp preprocessor is available then comments are limited to C style comments with (// and /* ... */) -- the # becomes a macro directive.
#
//
/* ... */
Each item within a section must appear on a single line. In order to break up an item statement (for better readability) us the line continuation character \.
If your system provides the cpp preprocessor, your Pequel script may include any C/C++ style macros and defines.
Specify properties.
This section contains free-format text to describe the function of the script.
Specify any external Perl package modules.
The items within this section consist of input data stream field names followed by any derived field definitions.
Specify any input field pre-processing which will occur before the field is referenced by any derived field.
The filter section specifies one or more condition item statements which will be used to match incoming data records and filter out any records that do not match all the condition item statements.
If the input record matches any of the condition item statements then divert the record to the specified Pequel process or file.
If the input record matches any of the condition item statements then copy the record to the specified Pequel process or file.
If the input record matches any of the condition item statements then display the specified message to stderr.
If the input record matches any of the condition item statements then display the specified message to stderr then exit the process.
The sort by section contains a list of input field items with optional type and sort order specifications. These fields specify the sort ordering for the input data stream.
The group by section contains a list of input field items with optional type specification. These fields specify the grouping requirements for the input data stream.
The output section contains a list of output field definitions.
Specify any output field post-processing.
The having section specifies one or more condition item statements which will be used to match output data records and filter out any records that do not match all the condition item statements.
If the output record matches any of the condition item statements then divert the record to the specified Pequel process or file.
If the output record matches any of the condition item statements then copy the record to the specified Pequel process or file.
If the output record matches any of the condition item statements then display the specified message to stderr.
If the output record matches any of the condition item statements then display the specified message to stderr then exit the process.
Initialise local tables.
Load and initialise external tables.
Load table from output of external Pequel script.
This section is used to declare various options described in detail below. Options define the overall character of the data transformation.
options <option> [ (<arg>) ] [, ...]
options input_delimiter(\s+), # one or more space(s) delimit input fields. verbose(100000), # print progress on every 100000'th input record. optimize, varnames, default_date_type(DD/MM/YY), nonulls, diag
Set the verbose option to display progress information to STDERR during the transform run. Requires one parameter. This will instruct Pequel to display a counter message on specified number of records read from input.
Supress all processing messages to stderr.
Specify a prefix path. The prefix will be used with all external file names unless the name starts with a '/'.
Specify the character that is used to delimit columns in the input data stream. This is usually the pipe | character, but can be any character including the space character. For multiple spaces use \s+, and for multiple tabs use \t+. This input delimiter will default to the pipe character if input_delimiter is not specified.
|
\s+
\t+
Specify the character that will delimit columns in the output. The output delimiter will default to the input delimiter if not specified. Refer to input_delimiter above for more information regarding types of delimiters.
If the input data stream contains an initial header record then this option must be specified in order to discard this record from the processing.
Specify the file name as a parameter. If specified, the input data will be read from this file; otherwise it will be read from STDIN. If the input_file option contains a Pequel script name (anyting ending in .pql) then the output from executin this input script will be chained to produce the input data stream.
Specify the file name as a parameter. If specified, the output will be written to this file (the file will be overwritten!); otherwise it will be sent to STDOUT.
Copy the input record to output. The input record is copied as is, including calculated fields, to the output record. Fields specified in the output section are placed after the input fields. The transfer option is not available when group by us in use.
transfer
Use hash processing mode. Hash mode is only available when break processing is activated with 'group by'. In hash mode input data need not be sorted. Because this mode of processing is memory intensive, it should only be used when generating a small number of groups. The optional 'numeric' modifier can be specified to sort the output numerically; if not specified, a string sort is done.
If specified then an initial header record will by written to output. This header record contains the output field names. By default a header record will be output if neither header nor noheader is specified.
Specify this option to suppress writing of header record.
Specify this option to add an extra delimiter character after the last field. This is the default action if neither addpipe nor noaddpipe is specified.
Specify this option to suppress adding an extra delimiter character after the last field.
If specified the generated Perl code will be optimized to run more efficiently. This optimisation is done by grouping similar where conditions into if-else blocks. Thus if a number of where clauses contain the same condition, these statements will be grouped under one if condition. The optimize option should only be used by users with some knowledge of Perl.
if-else
Specify this option to prevent code from being optimised. This is the default setting.
If specified, numeric and decimal values with a zero/null value will be output as null character. This is the default setting.
If specified, numeric and decimal values with a zero/null value will be output as 0.
0
Use this option to specify a file name to contain the rejected records. These are records that are rejected by the filter specified in the reject section. If no reject file option is specified then the default reject file name is the script file name with .reject appended.
.reject
Set this option to save the generated code in scriptname.2.code files. The scriptname.2.code file contains the generated perl code. This latter contains the actual Perl program that will process the input data stream. This generated Perl program can be executed independatly of Pequel.
Specify a default date type. Currently supported date types are: YYYYMMDD, YYMMDD, DDMMYY, DDMMMYY, DDMMYYYY, DD/MM/YY, DD/MM/YYYY, and US date formats: MMDDYY, MMDDYYYY, MM/DD/YY, MM/DD/YYYY. The DDMMMYY format refers to dates such as 21JAN02.
YYYYMMDD
YYMMDD
DDMMYY
DDMMMYY
DDMMYYYY
DD/MM/YY
DD/MM/YYYY
MMDDYY
MMDDYYYY
MM/DD/YY
MM/DD/YYYY
21JAN02
Specify the default list delimiter for array fields created by values_all and values_uniq aggregates. Any delimiter specified as a parameter to the aggregate function will override this.
values_all
values_uniq
If the input file is in DOS format, specify 'rmctrlm' option to remove the Ctrl-M at end of line.
Specify number of records to process from input file. Processing will stop after the number of records as specified have been read.
Use this option when summary section is used to prevent output of raw results.
Generate PDF for Programmer's Reference Manual for the Pequel script. The next three options are also required.
Specify the title that will appear on the pequeldoc generated manual.
Specify the user's email that will appear on the pequeldoc generated manual.
Specify the Pequel script version number that will appear on the pequeldoc generated manual.
Override the default gzcat command name and any additional agruments required.
Override the default cat command name and any additional agruments required.
Override the default sort command name and any additional agruments required.
The output data stream can be packed using the format specified in the output_pack_fmt. These properties can also be used to produce fixed format and binary output. The default format is A3/Z* repeated for each output field. Please refer to the Perl perlpacktut manual for a detailed desctiption of formats.
The packed input data stream can be unpacked using the format specified in the unput_pack_fmt. These properties can also be used to input fixed format and binary input. The default format is A3/Z* repeated for each input field. Please refer to the Perl perlpacktut manual for a detailed desctiption of formats.
The following options require that the Inline::C Perl module and a C compiler system is installed on your system.
The use_inline option will instruct Pequel to generate (and compile/link) C code -- replacing the input file identifier inside the main while loop by a readsplit() function call. The readsplit function is implemented in C.
Specify one or more extra field delimiter characters. These may be one of any quote character, ', ", `, and optionally, one of and bracket character, {, [, (. For example, this option can be used to parse input Apache log files in CLF format:
options input_delimiter_extra("[) // Apache CLF log quoted fields and bracketed timestamp
Tells Inline to clean up the current build area if the build was successful. Sometimes you want to DISABLE this for debugging. Default is 1.
Tells Inline to clean up the old build areas within the entire Inline DIRECTORY. Default is 0.
Tells Inline to print various information about the source code. Default is 0.
Tells ILSMs that they should dump build messages to the terminal rather than be silent about all the build details.
Tells ILSMs to print timing information about how long each build phase took. Usually requires Time::HiRes
Makes Inline build (compile) the source code every time the program is run. The default is 0.
The DIRECTORY config option is the directory that Inline uses to both build and install an extension.
Normally Inline will search in a bunch of known places for a directory called '.Inline/'. Failing that, it will create a directory called '_Inline/'
If you want to specify your own directory, use this configuration option.
Note that you must create the DIRECTORY directory yourself. Inline will not do it for you.
Specify which compiler to use.
This controls the MakeMaker OPTIMIZE setting. By setting this value to '-g', you can turn on debugging support for your Inline extensions. This will allow you to be able to set breakpoints in your C code using a debugger like gdb.
Specify extra compiler flags.
Specifies external libraries that should be linked into your code.
Specifies an include path to use. Corresponds to the MakeMaker parameter.
Specify which linker flags to use.
NOTE: These flags will completely override the existing flags, instead of just adding to them. So if you need to use those too, you must respecify them here.
Specify the name of the 'make' utility to use.
Use this section to specify Perl packages to use. This section is optional.
use package <Perl package name> [, ...]
use package Benchmark, EasyDate
Use init table to initialise tables in the Pequel script. This will consist of a list of table name followed by key value (or value list) pairs. The key must not contain any spaces. In order to avoid clutter in the script, use load table as described above. To look up a table key/value use the %table name(key) syntax. Table column values are accessed by using the %table name(key)-=>n syntax, when n refers to a column number starting from '1'. The column specification is not required for single value tables. All entries within a table should have the same number of values, empty values can be declared with a null quoted value (''). This section is optional.
init table <table> <key> <value> [, <value>...]
init table // Table-Name Key-Value Field->1 Field-2 Field-3 LOCINFO NSW 'New South Wales' '2061' '02' LOCINFO WA 'Western Australia' '5008' '07' LOCINFO SA 'South Australia' '8078' '08' input section LOCATION, LDESCRIPT => %LOCINFO(LOCATION)->1 . " in postcode " . %LOCINFO(LOCATION)->2
Use this section to declare tables that are to be initialised from an external data file. If the table is in .tbl format (key|value) then only the table name (without the .tbl) need be specified. The filename can consist of the full path name. Compressed files (ending in .gz, .z, .Z, .zip) will be handled properly. If key column is not specified then this is set to 1 by default; if the value column is not specified then this is set to 2 by default. Column numbers are 1 base. To look up a table key/value use the %table name(key) syntax. If the table name is prefixed with the _ character, this table will be loaded at runtime instead of compile time. Thus the table contents will not appear in the generated code. This is useful if the table contains more than a few hundred entries, as it will not clutter up the generated code.
.tbl
The persistant option will make the table disk-based instead of memory-based. Use this option for tables that are too big to fit in available memory. The disk-based table snapshot file will have the name _TABLE_name.dat, where name is the table name. When the persistant option is used, the table is generated only once, the first time it is used. Thereafter it will be loaded from the snaphot file. This is alot quicker and therefore usefull for large tables. In order to re-generate the table, the snapshot file must be manually deleted. In order to use the persistant option the Perl DB_File module must be available. The effect of persistant is to tie the table's associative array with a DBM database (Berkeley DB). Note that using persistant tables will downgrade the overall performance of the script.
_TABLE_name.dat
name
persistant
tie
load table [ persistant ] <table> [ <filename> [ <key_col> [ <val_col> ] ] ] [, ...]
load table POSTCODES MONTH_NAMES /data/tables/month_names.tbl POCODES pocodes.gz 1 2 ZIPSAMPLE zipsample.txt 3 21
This section defines the format of the input data stream. Any calculated fields must be placed after the last input field. The calculation expression must begin with => and consists of (almost) any valid Perl statement, and can include input field names. All macros are also available to calculation expressions. The input section must appear before all the sections described below. Each input field name must be unique.
input section <input field name> [ => <calculation expression> ] [, ...]
input section ACL, AAL, ZIP, CALLDATE, CALLS, DURATION, REVENUE, DISCOUNT, KINSHIP_KEY, INV => REVENUE + DISCOUNT, MONTH_CALLDATE => &month(CALLDATE), GROUP => MONTH_CALLDATE <= 6 ? 1 : 2, POSTCODE => %POSTCODES(AAL), IN_SAMPLE => exists %ZIPSAMPLE(ZIP), IN_SAMPLE_2 => exists %ZIPSAMPLE(ZIP) ? 'yes': 'no'
Use this section to perform addition formatting/processing on input fields. These statements will be performed right after the input record is read and before calculating the input derived fields.
Use this section to perform addition formatting/processing on output fields. These statements will be performed after the aggregations and just prior to the output of the aggregated record.
Use this section to sort the input data by field(s). One or more sort fields can be specified. This section must appear after the input section and before the group by and output sections. The numeric option is used to specify a numeric sort, and the desc option is used to specify a descending sort order. The standard Unix sort command is used to perform the sort. The numeric option is translated to the -n Unix sort option; the desc option is translated to the -r Unix sort option. If the input data is pre sorted then the sort by section is not required (even if break processing is activated with a group by section declaration). The sort by section is not required when the hash option is specified.
sort by <field name> [ numeric ] [ desc ] [, ...]
sort by ACL, AAL numeric desc
Specify one or more filter expressions. Filter expression can consist of any valid Perl statement, and must evaluate to Boolean true or false (0 is false, anything else is true). It can contain input field names and macros. Each input record is evaluated against the filter(s). Records that evaluate to true on any one filter will be rejected and written to the reject file. The reject file is named scriptname.reject unless specified in the reject_file option.
reject <filter expression> [, ...]
reject !exists %ZIPSAMPLE(ZIP) INV < 200
Specify one or more filter expressions. Filter expression can consist of any valid Perl statement, and must evaluate to Boolean true or false. It can contain input field names and macros. Each input record is evaluated against the filter(s). Only records that evaluate to true on all filter statements will be processed; that is, records that evaluate to false on any one filter statement will be discarded.
filter <filter expression> [, ...]
filter exists %ZIPSAMPLE(ZIP) ACL =~ /^356/ ZIP eq '52101' or ZIP eq '52102'
Use this section to activate break processing. Break processing is required to be able to use the aggregates in the output section. One or more fields can be specified - the input data must be sorted on the group by fields, unless the hash option is used. A break will occur when any of the group field values changes. The group by section must appear after the sort by section and before the output section. The numeric option will cause leading zeros to be stripped from the input field. Group by on calculated input fields is usefull when the hash option is in use because the input does not need to be pre-sorted.
group by <input field name> [ numeric | decimal | string ] [, ...]
group by AAL, ACL numeric
This is where the output data stream format is specified. At least one output field must be defined here (unless the transfer option is specified). Each output field definition must end with a comma or new line (or both). Each field definition must begin with a type (numeric, decimal, string, date). The output field name can be the same as an input field name, unless the output field is a calculated field. Each output field name must be unique. This name will appear in the header record (if the header option is set). The aggregate expression must consist of at least the input field name.
numeric
decimal
string
date
The aggregates sum, min, max, avg, first, last, distinct, values_all, and values_uniq must be followed by an input field name. The aggregates count and flag must be followed by the * character. The aggregate serial must be followed by a number (indicating the serial number start).
sum
min
max
avg
first
last
distinct
count
flag
*
serial
A prefix of _ in the output field name causes that field to be transparent; these fields will not be output, their use is mainly for intermediate calculations. <input field name> can be any field declared in the input section, including calculated fields. This section is required unless the transfer option is specified.
output section <type> <output field name> <output expression> [, ...]
<type>
numeric, decimal, string, date [ (<datefmt>) ]
<output field name>
Each output field name must be unique. Output field name can be the same as the input field name, unless the output field is a calculated field. A _ prefix denotes a transparent field. Transparent fields will not be output, they are used for intermediate caclulations.
<datefmt>
YYYYMMDD, YYMMDD, DDMMYY, DDMMMYY, DDMMYYYY, DD/MM/YY, DD/MM/YYYY, MMDDYY, MMDDYYYY, MM/DD/YY, MM/DD/YYYY
<output expression>
<input field name>
<aggregate> <input field name> [ where <condition expression> ]
serial <start num> [ where <condition expression> ]
count * [ where <condition expression> ]
flag * [ where <condition expression> ]
= <calculation expression> [ where <condition expression> ]
<aggregate>
sum | maximum | max | minimum | min | avg | mean | first | last | distinct
| sum_distinct | avg_distinct | count_distinct
| median | variance | stddev | range | mode
| values_all [ (<delim>) ] | values_uniq [ (<delim>) ]
Any field specified in the input section.
<calculation expression>
Any valid Perl expression, including input and output field names, and Pequel macros. This expression can consist of numeric calculations, using arithmetic operators (+, *, -, etc) and functions (abs, int, rand, sqrt, etc.), string calculations, using string operators (eg. . for concatenation) and functions (uc, lc, substr, length, etc.).
+
abs
int
rand
sqrt
.
uc
lc
substr
length
<condition expresion>
Any valid Perl expression, including input and output field names, and Pequel macros, that evaluates to true (non-zero) or false (zero).
Accumulate the total for all values in the group. Output type must be numeric, decimal or date.
Accumulate the total for distinct values only in the group. Output type must be numeric, decimal or date.
Output the maximum value in the group. Output type must be numeric, decimal or date.
Output the minimum value in the group. Output type must be numeric, decimal or date.
Output the average value in the group. Output type must be numeric, decimal or date.
Output the average value for distinct values only in the group. Output type must be numeric, decimal or date.
Output the first value in the group.
Output the last value in the group.
Output the count of unique values in the group. Output type must be numeric.
The median is the middle of a distribution: half the scores are above the median and half are below the median. When there is an odd number of values, the median is simply the middle number. When there is an even number of values, the median is the mean of the two middle numbers. Output type must be numeric.
Variance is calculated as follows: (sum_squares / count) - (mean ** 2), where sum_squares is each value in the distribution squared (** 2); count is the number of values in the distribution; mean is discussed above. Output type must be numeric.
Stddev is calculated as the square-root of variance. Output type must be numeric.
The range is the maximum value minus the minimum value in a distribution. Output type must be numeric.
The mode is the most frequently occuring score in a distribution and is used as a measure of central tendency. A distribution may have more than one mode, in which case a space delimited list is returned. Any output type is valid.
Output the list of all values in the group. The specified delimiter delimits the list. If not specified then the default_list_delimiter specified in options is used.
Output the list of unique values in the group. The specified delimiter delimits the list. If not specified then the default_list_delimiter specified in options is used.
Output the next serial number starting from n. The serial number will be incremented by one for each successive output record. Output type must be numeric.
Output the count of records in the group. Output type must be numeric.
Output 1 or 0 depending on the result of the where condition clause. If no where clause is specified then the output value is set to 1. The output will be set to 1 if the where condition evaluates to true at least once for all records within the group. Output type must be numeric.
New in v2.5. Returns the coefficient of correlation of a set of number pairs.
New in v2.5. Returns the population covariance of a set of number pairs.
New in v2.5. Returns the sample covariance of a set of number pairs.
New in v2.5. Calculates the cumulative distribution of a value in a group of values.
New in v2.5. Computes the rank of a row in an ordered group of rows.
New in v2.5. Calculates the rank of a value in a group of values.
Calculation expression follows. Use this to create output fields that are based on some calculation expression. The calculation expression can consist of any valid Perl statement, and can contain input field names, output field names and macros.
output section numeric AAL AAL string _HELLO = 'HELLO' string _WORLD = 'WORLD' string HELLO_WORLD = _HELLO . ' ' . _WORLD decimal _REVENUE sum REVENUE decimal _DISCOUNT sum DISCOUNT decimal INVOICE = _REVENUE + _DISCOUNT
The having section is applied after the grouping performed by group by, for filtering groups based on the aggregate values. Break processing must be activated using the group by section. The having section must appear after the output section. Specify one or more filter expressions. Filter expression can consist of any valid Perl statement, and must evaluate to Boolean true or false. It can contain input field names, output field names and macros. Only groups that evaluate to true on all filter statements will be output; that is, groups that evaluate to false on any one filter statement will be discarded. Each filter statement must end with a comma and/or new line.
having <filter expression> [, ...]
having SAMPLE == 1 MONTH_1_COUNT > 2 and MONTH_2_COUNT > 2
This section contains any perl code and will be executed once after all input records have been processed. Input, output field names, and macros can be used here. This section is mostly relevant when group by is omitted, so that a group all is in effect. The suppress_output option should also be used. If the script contains a group by section and more than one group of records is produced, only the last group's values will appear in the summary section.
group all
summary section < Perl code >
summary section print "*** Summary Report ***"; print "Total number of Products: ", sprintf("%12d", COUNT_PRODUCTS); print "Total number of Locations: ", sprintf("%12d", COUNT_LOCATIONS); print "*** End of report ***";
Open Input Stream
Load/Connect Tables
Read Next Input Record
Output Aggregated Record If Grouping Key Changes
Calculate Derived Input Fields
Perform Aggregations
Process Outline:
282 POD Errors
The following errors were encountered while parsing the POD:
'=item' outside of any '=over'
Unknown directive: =page
You forgot a '=back' before '=head1'
=begin without a target?
'=end' without a target?
You forgot a '=back' before '=head2'
=over without closing =back
'=end' without a target? (Should be "=end open")
'=end' without a target? (Should be "=end options")
To install ETL::Pequel, copy and paste the appropriate command in to your terminal.
cpanm
cpanm ETL::Pequel
CPAN shell
perl -MCPAN -e shell install ETL::Pequel
For more information on module installation, please visit the detailed CPAN module installation guide.