Text::Parser::Manual::ExtendedAWKSyntax - The ExAWK (extended AWK) syntax itself
version 1.000
So want to get started with writing your own parser based on Text::Parser. And the best place to start is to learn how to write parsing rules.
In this chapter, we only describe the ExAWK language syntax and features. But this is the main part. In fact, the remaining things are extremely simple. And when you see how intuitive the rules are, you'll wonder why this was not part of native Perl.
Parsing rules may be specified using the add_rule, BEGIN_rule, and END_rule methods. Alternatively, you may sub-class Text::Parser, and use the applies_rule syntax sugar from Text::Parser::RuleSpec. In either case, the syntax and form of the rules is what determines what your text parser does.
add_rule
BEGIN_rule
END_rule
applies_rule
The basic form of the ExAWK rule is like this:
if => 'condition', do => 'action' ## options: ## dont_record => 0|1 ## continue_to_next => 0|1
AWK programmers would recognize that it is similar to the basic syntax of AWK:
condition { action; }
Pay attention to the single quotes in the value of the if and do keys. This is important as you'll see below.
if
do
Just as in AWK the condition can be as simple as a regular expression, or a complex boolean expression. In AWK you could do:
condition
$ awk '/EMAIL:/ {print $2}' file.txt
In ExAWK you could write something like this to get something equivalent (Note the need for "\n"):
"\n"
$parser->add_rule( if => 'm/EMAIL:/', do => 'print $2, "\n"' );
Similar to AWK, in the Extended AWK language too, if the condition is specified, then the action is optional, and if the action is specified, then the condition is optional. The default condition in ExAWK is just like in AWK: true for each input line. So the following will simply print every line in a file:
action
$parser->add_rule( do => 'print' # The default 'if' is true for all lines );
Or else:
$parser->add_rule( if => 'm/^\d+/' # The default 'do' stores the whole line. );
AWK is very popular for its intuitive field identifiers $1, $2, $3 etc. ExAWK provides the same and much more.
$1
$2
$3
In AWK, $0 identifies the whole line. The same is true in ExAWK. But in addition, the Perl in-built variable $_ also contains the line. So any in-built functions that take a missing parameter to be $_ will behave accordingly. This is how a terse rule like do => 'print' happens to work. The important difference between $0 and $_ is that $0 is just an identifier (i.e., you cannot change its value), whereas $_ is an actual variable which you can change. By changing $_, you cannot change the value of the current line, or the value of $0.
$0
$_
do => 'print'
Similarly, other positional field identifiers like $1, $2, etc., are not variables. Even in AWK, they are not variables. They are just positional field identifiers. They represent an Rvalue and cannot be modified. So for example
$ awk '// {$1 = "something";}' file.txt
will not change anything. In the same way:
$parser->add_rule(do => '$1 = "something";');
In particular they are not the same as the native Perl regular expression field identifiers $1, $2 etc., which are used in regexp substitutions.
The positional field identifiers $1, $2 etc. mean something else inside the string expressions of ExAWK. Like AWK, $1 represents the first field, $2 represents the second field, and so on.
Note: In the UNIX implementation of AWK, the positional identifiers are limited to $9. In POSIX implementations this limitation has already been removed. In ExAWK also there is no limit to the number of positional identifiers.
$9
You should always use single quotes ('') for your rule strings, and not double quotes (""). This is because sigils like $ get dereferenced inside double quotes, and $0, $1, etc., have no value in your main code.
''
""
$
The first difference with AWK is the language. ExAWK condition and action strings are Perl. So for example, to compare strings you should use eq and not == like you would in AWK. The condition and action strings are transformed into regular Perl and compiled. So if they fail to compile, add_rule method will throw an exception.
eq
==
In AWK, each rule is run for each line, even if the condition for a previous rule may be true. If the condition of a rule is true, the action is performed.
But in ExAWK, rules are executed until the condition of one rule is true. By default, the execution of further rules stops at that point. For example:
$parser->add_rule(if => '$1 =~ /^[#]/', dont_record => 1);
In this case, the moment a line leading with # is encountered, it is ignored, and no other rules are executed. But see the following code:
#
$parser->add_rule(if => '$1 =~ /^[#][!]/', continue_to_next => 1); $parser->add_rule( if => '$1 =~ /perl$/', do => '$this->abort_reading; print "This is a perl script.\n";' ); $parser->add_rule( if => '$1 =~ /bash$/', do => '$this->abort_reading; print "This is a bash script.\n";' ); $parser->add_rule( if => '$1 =~ /^[#]/ or $this->lines_parsed > 0', do => '$this->abort_reading; print "Neither perl nor bash.\n";', dont_record => 1 );
Now if a file starts with #!, the condition for the first rule is met, and it immediately tests the next rule. If that condition is met, it will abort reading at that point and print the message. But if the condition for the second rule is not met, it will test the next rule. If that condition is also not met, then it will test the fourth rule. In it will surely meet the condition for the fourth rule (we could have skipped it), and will execute that rule. At this point it will stop because there is no continue_to_next option.
#!
continue_to_next
So in this way we can control the execution sequence.
We saw that only one of condition or action is required, the other may be omitted. The default condition is same as AWK. But the default action is different. The default action in AWK is print. Thus:
print
$ awk '/li/' file.txt
will print all the lines with 'li' somewhere in it. But in ExAWK, since it integrates with the Text::Parser class, the default action is to return the whole line.
'li'
return
if => 'm/li/' # returns each line containing 'li', to the parser # The parser then saves it as a record, # unless dont_record is true
If you want to print instead and not record anything, you need to specify that:
if => 'm/li/', do => 'print', dont_record => 1
To access fields from the end of the line, use identifiers ${-1}, ${-2}, etc. ${-1} is the last field, ${-2} is the penultimate field, and so forth.
${-1}
${-2}
Sometimes, we want to access all the fields starting from the 2nd, or 3rd, leaving all the earlier ones. So we have a set of shortcuts to do all that. Below are some shortcut examples:
SHORTCUT CODE EQUIVALENT MEANING ======== ================ ======== ${2+} $this->join_range(1, -1) Everything from second field as a string. Spaces will be collapsed to one space. @{3+} $this->field_range(2, -1) Everything from third field as an array. \@{2+} [ $this->field_range(2, -1) ] Arrayref containing everything from second field.
In AWK if you write:
$ awk '{print $4;}' text.txt
then all lines with 3 or less fields will print a blank line to the screen because $4 evaluates to empty string when there are less than 4 fields on a line. So you would get empty lines. To ensure you take only lines with 4 fields, you need to do:
$4
$ awk 'NF>=4 {print $4;}' text.txt
But in ExAWK this is unnecessary. So the rule:
do => 'return $4;'
automatically sets up a pre-condition for the number of fields NF and ensures that each line being read has at least 4 fields. This works even for negative positional indicators. So the rule:
NF
do => 'return ${-4};'
This ensures you don't run into undef records being saved in the parser.
undef
You can use any Perl local variables you want. For example:
do => 'my (@numbers) = @{3+};'
Note that @numbers above is accessible only within that rule action. It is not accessible outside of that do string.
@numbers
Use any variable other than $this.
$this
Perl anyway has more built-in functions that are very useful and better than their AWK counterparts. But in addition, CPAN has a lot of great modules with utility functions. ExAWK gives the programmer adds a few good utility functions, but also makes it very easy to add any other functions:
Scalar::Util : blessed, looks_like_number
blessed
looks_like_number
String::Util : All functions here
List::Util : The following functions: reduce, any, all, none, notall, first, max, maxstr, min, minstr, product, sum, sum0, pairs, unpairs, pairkeys, pairvalues, pairfirst, pairgrep, pairmap, shuffle, uniq, uniqnum, and uniqstr
reduce
any
all
none
notall
first
max
maxstr
min
minstr
product
sum
sum0
pairs
unpairs
pairkeys
pairvalues
pairfirst
pairgrep
pairmap
shuffle
uniq
uniqnum
uniqstr
I have kept this list small to minimize Text::Parser dependencies. The user can import whatever functions they want from the package of their choice.
Text::Parser
Suppose you know of a very useful package (fictitiously named) Useful::Package. And let's say it has functions foo and bar that are very useful and operate on strings. And you wish to use these in your rules. Then do the following in your code:
Useful::Package
foo
bar
use Import::Into; Useful::Package->import::into('Text::Parser::Rule', qw(foo bar)); use Text::Parser; my $parser = Text::Parser->new(); $parser->add_rule(if => 'bar($2)', do => 'return foo($1).foo($2);');
This means that the power of any new package on CPAN can be harnessed very easily.
If you could use the parser object itself to store data, this would open up many possibilities. And that is precisely what this section is about.
To access the parser object, you can use $this inside the rule strings. Remember again, that the rule strings should be in single quotes (''). Here is an example rule using the $this variable to refer to the parser:
$parser->add_rule( if => '$this->lines_parsed > 10', do => '$this->abort_reading;', dont_record => 1, );
$this is a real variable. If you modify its value, it will change. So be careful what you do with $this. If you save the $this to another variable in the hope that you can retrieve it later, remember that all positional field indicators and range shortcuts are entirely dependent on the this variable. If this variable is tampered with, you could get garbage results. You have been forewarned.
this
The idea of storing data in the $this variable is the obvious next step. You get what are called "stashed variable"s. Internally these variables are just stored in a hash, but it is as if you are stashing away useful data. You can store any scalar, hashref or arrayref. Note that you can't store an array or hash itself, only arrayref or hashref.
Stashed variables begin with the tilde (~) character, followed by an alphabet or underscore (_. So for example:
~
_
if => '$1 eq "MARKER:"', do => '~info = $2; ~_secret = $3;'
In the above rule, ~info and ~_secret are stashed variables, which are accessible in other rules. You can set stashed variables in a BEGIN_rule, and access it in later rules. Here is an obvious example:
~info
~_secret
my $parser = Text::Parser->new(); $parser->BEGIN_rule( do => '~count = 0;' ); $parser->add_rule( if => '$1 eq "ERROR:"', do => '~count++;' ); $parser->read('/path/to/logfile.log'); print "Found ", $parser->stashed('count'), " errors in your logfile\n";
You may forget a stashed variable, and it will be lost for ever. Or you can simply clear the whole stash of variables by using clear_stash method.
forget
clear_stash
All stashed variables are forgotten right before read starts reading the input. So you have a clean stash each time you call read.
read
$parser->read('/another/logfile.log'); print "Whereas, the other one had ", $parser->stashed('count'), " errors\n";
You can also have pre-stashed variables that persist across multiple read method calls. Read more about that here.
When you sub-class Text::Parser you can get some very powerful features. You can do something like this:
package MyClass::Parser; use Moose; extends 'Text::Parser'; use Text::Parser::RuleSpec; has section => ( is => 'ro', isa => 'Str', lazy => 1, ); has ids => ( is => 'ro', isa => 'HashRef[ArrayRef[Str]]', default => sub {return {};}, lazy => 1, handles => { get_section => 'get', set_section => 'set', has_section => 'exists', }, ); sub add_id { my $self = shift; my $ids = $self->get_section($self->section); push @{$ids}, shift; } applies_rule find_section => ( if => '$1 eq "SECTION:"', do => '$this->section($2); $this->set_section($2 => []);', ); applies_rule name_in_section => ( if => '$1 eq "ID"', do => '$this->add_id($2);' );
You can see a lot more examples in Text::Parser::RuleSpec;
This allows you to write your own parser. In fact, because you can now use inheritance to create a sub-class, you can sub-class that sub-class also, thereby making a variant of a given parser.
ExAWK is in Perl language, and includes a whole arsenal of utility functions, and also the power to use any functions from any desired CPAN package.
Execution of rules can be controlled and not all rules need to be run.
In ExAWK, you may use identifiers like ${-1}, ${2+}, @{3+} etc. Using any positional identifier automatically adds a condition that tests if the line has the minimum number of fields required.
${2+}
@{3+}
You can use regular Perl variables inside rules, or you may use stashed variables.
You can sub-class Text::Parser to make your own parser class. And then you can sub-class that further to re-use your code and create multiple variants.
Please report any bugs or feature requests on the bugtracker website http://github.com/balajirama/Text-Parser/issues
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.
Balaji Ramasubramanian <balajiram@cpan.org>
This software is copyright (c) 2018-2019 by Balaji Ramasubramanian.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install Text::Parser, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Parser
CPAN shell
perl -MCPAN -e shell install Text::Parser
For more information on module installation, please visit the detailed CPAN module installation guide.