The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::Parser::Manual::ExtendedAWKSyntax - A manual for ExAWK (extended AWK) syntax

VERSION

version 0.927

THE EXTENDED AWK LANGUAGE

So you saw the power of Text::Parser and want to write your own. First you need to learn something about the rules.

Why extend?

The AWK programming language does give us the flexibility to do a number of things. But it is limited in many respects. Below is a list of things that come to my mind:

  • AWK's regular expressions are limited. Perl is superior here and we want to leverage that.

  • You can't create deep data-structures (multi-dimensional arrays/hashes) in AWK. You can't create objects and classes.

  • Every rule will be tested and executed. It would be nice to control whether the next rule would be executed.

  • The UNIX version of AWK has only nine field identifiers $1 through $9. GAWK and other implementations remove this limitation.

  • AWK has a limited set of built-in functions.

AWK itself cannot be used for much more than reading text files and processing them. It is not really useful for a more complex program.

Why AWK?

Despite its limitations, AWK is excellent for parsing and processing text input. And despite the fact that Perl is supposed to allow us to do something more advanced, parsing text files should be as easy as it is with AWK. So instead of re-inventing the wheel... (you get the point?).

BASIC SYNTAX

The basic syntax of the AWK program is:

    condition { task; }

If condition is specified, then the task block is optional, and if the task block is specified, then the condition is optional.

The basic form of the ExAWK rule is like this:

    if => 'condition', do => 'task'
            ## options: 
            ##   dont_record => 0|1
            ##   continue_to_next => 0|1

These are normally supplied as arguments to add_rule, BEGIN_rule, and END_rule.

Similar to AWK, if condition is specified, then the task is optional, and if the task is specified, then the condition is optional. The language for the condition and task is Perl (not AWK). So for example, to compare strings you should use eq and not == like you would in AWK. If the 'if' and 'do' strings are transformed into regular Perl and compiled.

Simplicity

Just as in AWK the condition can be as simple as a regular expression, or a complex boolean expression. In AWK you could do:

    $ awk '/EMAIL:/ {print $2}' file.txt

In ExAWK you could write something like this to get something equivalent (Note the need for "\n"):

    if => 'm/EMAIL:/', do => 'print $2, "\n"'

The default condition in ExAWK is just like in AWK: true for each input line. The following will simply print every line in a file:

    do => 'print'

The default task in AWK is print. Thus:

    $ awk '/li/' file.txt

will print all the lines with 'li' somewhere in it. But in ExAWK, since it integrates with the Text::Parser class, the default task is return $0;. This means that if you provide a condition but not a task, in ExAWK, the default is to return the line as it is.

    if => 'm/li/'  # returns each line that contains 'li' in it.

If you want to print instead and not record anything, you need to specify that:

    if => 'm/li/', do => 'print', dont_record => 1

In ExAWK, the Perl in-built variable $_ is set to the current line. So any in-built functions that take a missing parameter to be $_ will behave accordingly (this is how if => 'm/li/' and do => 'print' happen to work).

Field identifiers

AWK is very popular for its intuitive field identifiers $1, $2, $3 etc. ExAWK provides the same and much more.

It is important to note that $1, $2, etc., are not variables, even in AWK. They are just positional field identifiers. They represent an Rvalue and cannot be modified. So for example

    $ awk '// {$1 = "something";}' file.txt

will not accomplish anything. The first field in each line remains what it is.

Similarly, ExAWK identifiers $1, $2 etc., are also not variables. In particular they are not the same as the native Perl regular expression field identifiers $1, $2 etc., which are used in regexp substitutions.

The positional field identifiers $1, $2 etc. have special meaning inside the string expressions of ExAWK. Like AWK, $1 represents the first field, $2 represents the second field, and so on. Like AWK, $0 identifies the whole line.

Reverse field identifiers

Now we add new features that really go beyond AWK. To access fields from the end of the line, use identifiers ${-1}, ${-2}, etc. ${-1} is the last field, ${-2} is the penultimate field, and so forth.

Automatic checks for NF

You don't need to bother about the existence of a field when you write these expressions. For example, in AWK if you write:

    $ awk '$4 == ""' text.txt

then all lines with 3 or less fields will automatically be printed to the screen because $4 evaluates to empty string when there are less than 4 fields on a line. But in ExAWK:

    if => '$4 eq ""'

would never be true. (Why?) Now, if you had written a rule like this in AWK:

    $ awk '$1 == "MIDDLE" && $2 == "NAME:" {print toupper($3)}' file.txt

you might get a lot of empty lines for each person that has no middle name.

Instead, in ExAWK, the following rule:

    if => '$1 eq "MIDDLE" and $2 eq "NAME:"', do => 'return uc($3)'

would automatically check that there are at least three fields on the line. This means it will never return anything in case of people with no middle names. This ensures you don't run into undef.

The $this variable

Sometimes you want to access specific attributes of your parser class, or maybe you want to call a method. The $this variable is accessible in both the condition and the task strings.

Important Note: This is a real variable. If you modify the value of $this, it will change. So don't assign to the variable $this. If you save the $this to another variable in the hope that you can retrieve it later, remember that all positional field indicators and range shortcuts are entirely dependent on the this variable. If this variable is tampered with, you could get garbage results. You have been forewarned.

Local variables

You can use any Perl local variables you want. For example:

    do => 'my (@numbers) = ${3+}; # do something with @numbers'

Note that @numbers above is accessible only within that rule task. It is not accessible outside of that do string.

Use any variable other than $this.

Shared variables

If you want to create variables that are initialized or assigned in one rule, but accessed in another rule, you need to use a "shared variable". A shared variable can be a scalar, or a hash reference, or an array reference. It cannot be a hash or an array itself. All shared variables must begin with the tilde (~) character, whether scalar, arrayref, or hashref. And all of them must begin with an alphabet or underscore (_).

    if => '$1 eq "MARKER:"', do => '~info = $2;'

In the above rule, ~info is a shared variable, and will be accessible in other rules.

All shared variables created during the parsing of a text input exist only for the duration of the read method call. They are not accessible outside the Text::Parser class.

Suite of string and array utility functions

Perl anyway has more built-in functions that are very useful and better than their AWK counterparts. But in addition, CPAN has a lot of great modules with utility functions. ExAWK gives the programmer adds a few good utility functions, but also makes it very easy to add any other functions:

Utility functions added

I have kept this list small to minimize Text::Parser dependencies. The user can import whatever functions they want from the package of their choice.

How to add other utility functions

Suppose you know of a very useful package (fictitiously named) Useful::Package. And let's say it has functions foo and bar that are very useful and operate on strings. And you wish to use these in your rules. Then do the following in your code:

    use Import::Into;
    Useful::Package->import::into('Text::Parser::Rule', qw(foo bar));
    use Text::Parser;

    my $parser = Text::Parser->new();
    $parser->add_rule(if => 'bar($1)', do => 'return foo($2);');

This means that the power of any new package on CPAN can be harnessed very easily.

COMPLEX CONDITIONAL TREES

If you wanted to build a complex if-elsif-else tree of conditions, in AWK, you need to write them inside one rule like this:

    // {
        if (condition1) {
            task1;
        } elsif(condition2) {
            task2;
        } else {
            task3;
        }
        if (condition3) {
            task4;
        }
        if(condition4) {
            task5;
        } else {
            task6;
        }
    }

With the pair of options dont_record and continue_to_next one can build rules that replace any number of complex set of cascaded if-elsif-else blocks while still retaining most of them in an elegant single-line form.

    if => 'condition1', do => 'task1';
    if => 'condition2', do => 'task2';
    if => 1, do => 'task3', dont_record => 1, continue_to_next => 1;
    if => 'condition3', do => 'task4', dont_record => 1, conditnue_to_next => 1;
    if => 'condition4', do => 'task5';
    if => 1, do => 'task6';

Not only are these rules compact, it is also possible to understand the execution flow.

SUMMARY

  • AWK cannot store very complex data structures. ExAWK can.

  • In UNIX implementation of AWK, the positional variables are limited to $9. In POSIX implementations this limitation has already been removed. We also remove this limit.

  • In AWK there are no positional variables for positions counted from the end. In ExAWK, you have ${-1}, ${-2}, ${-3} etc.

  • In AWK if you use a positional variable like $8 when there are only 7 fields on a line, it evaluates to empty string. In ExAWK, if you use $8 in any of the strings, an automatic pre-condition is generated to check that there must be at least 8 fields on the input line.

Table of contents | Previous

BUGS

Please report any bugs or feature requests on the bugtracker website http://github.com/balajirama/Text-Parser/issues

When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.

AUTHOR

Balaji Ramasubramanian <balajiram@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2018-2019 by Balaji Ramasubramanian.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.