=head1 NAME

Lexing Without Grammars:  When Regular Expressions Suck

=head1 ABSTRACT

Perl is famed for its text processing capabilities.  However, sometimes the
data you want to process is too complicated for regular expressions and you
reach for a parser for your HTML, RTF, or other common format.  This article
discusses when you don't have a pre-defined parser but the text you need to
work with is too complicated for regular expressions.

=head1 ORIGINAL PUBLICATION

This article was originally published by O'Reilly in January, 2006 at
L<http://www.perl.com/pub/a/2006/01/05/parsing.html>.  Reproduced with
permission.

=head1 C<s/(?<!SHOOTING YOURSELF IN THE )FOOT/HEAD/g>

Most of us have tried at one time or another to use regular expressions to do
things we shouldn't.  Parse HTML, obfuscate code, wash dishes, etc.  This is
referred to by the technical term "showing off".  I've done it too:

 $html =~ s{
              (<a\s(?:[^>](?!href))*href\s*)
              (&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)
              ([^>]+>)
           }
           {$1 . decode_entities($2) .  $4}gsexi;

I was strutting like a peacock when I wrote that, followed quickly by eating
crow when I ran it.  I never did get that working right.  I'm still not sure
what I was trying to do.  That was the regular expression which forced me to
learn how to use C<HTML::TokeParser>.  More importantly, that was the regular
expression which taught me how difficult regular expressions can be.

=head2 The Problem with Regular Expressions

Let's look at that regex again:

 /(<a\s(?:[^>](?!href))*href\s*)(&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)([^>]+>)/

Do you know that matches?  Exactly?  Are you I<sure>?  Even if it works, how
easily can you modify it?

If you don't know what it was trying to do (and to be fair, don't forget it's
broken), how long did you spend trying to figure it out?  When's the last time
a single line of code gave you such fits?

The problem, of course, is that this regular expression is trying to do far
more work than a single line of code is likely to do.  When faced with a
regular expression like that, there are a few things I like to do.

=over 4

=item * Document them carefully.

=item * Use the /x switch so I can expand them over several lines.

=item * Possibly encapsulate them in a subroutine.

=back

Sometimes, though, there's a fourth option:  lexing.

=head2 Lexing

When developing code, we typically take a problem and break it down into a
series of smaller problems which are easier to solve.  Regular expressions are
code and they too can be broken down into a series of smaller problems which
are easier to solve.  One technique is to use I<lexing> to facilitate this.

Lexing is the act of taking data, breaking it down into discreet tokens and
assigning meaning to those tokens.  There's a bit of fudging in that
statement, but it pretty much covers the basics.

Lexing is typically followed by parsing whereby the tokens are then converted
into something more useful.  Parsing is frequently handled by having some tool
which applies a well-defined grammar to the lexed tokens.

Sometimes well-defined grammars are not practical for extracting and reporting
information.  There might not be a grammar available for a company's ad-hoc
log file format.  Other times we might find it easier to process the tokens
manually then to spend the time writing a grammar.  And still other times we
are only interested in part of the data you've lexed, not all of it.  All
three of these reasons apply to the following problem.

=head2 Parsing SQL

Recently on Perlmonks (http://perlmonks.org/index.pl?node_id=472684), someone
had the following SQL:

  select the_date as "date",
  round(months_between(first_date,second_date),0) months_old
  ,product,extract(year from the_date) year
  ,case
    when a=b then 'c'
    else 'd'
    end tough_one
  from ...
  where ...

What they needed from that SQL was the alias for each column.  In this case,
those would be "date", "months_old", "product", "year", "tough_one."  Of
course, they mentioned that this was only one example.  There's actually
plenty of generated SQL, all with subtle variations on how the columns are
aliased so this is not a trivial task.  What's interesting about this, though,
is that we don't give a fig about anything except the column aliases.  The
rest of the text is merely there to help us find those aliases.

Our first thought might be to try and parse this with C<SQL::Statement>.  As
it turns out, this module does not handle C<CASE> statements.  Thus, we're
left with either trying to figure out how to patch C<SQL::Statement>, submit
said patch, hope it gets accepted and released in a timely fashion.  (Note
that C<SQL::Statement> uses C<SQL::Parser>, so the latter is also not an
option).

Second, a number of us have worked in environments where problems have to be
solved in production I<now> but we still have to wait three weeks for the
necessary modules to be installed, if they'll be approved at all.

The most important reason, though, is even if C<SQL::Statement> could handle
this problem, this would be an awfully short article if you used it instead of
a lexer.

=head2 Lexing Basics

As mentioned earlier, lexing is essentially the task of analyzing data and
breaking it down into a series of easy-to-use tokens.  While the data may be
in other forms, usually this means analyzing strings.  To give a trivial
example, consider the following:

 x = (3 + 2) / y

When lexed, we might get a series of tokens like the following:

 my @tokens = (
   [ VAR => 'x' ],
   [ OP  => '=' ],
   [ OP  => '(' ],
   [ INT => '3' ],
   [ OP  => '+' ],
   [ INT => '2' ],
   [ OP  => ')' ],
   [ OP  => '/' ],
   [ VAR => 'y' ],
 );

With a proper grammar, we could then read this series of tokens and take
actions based upon their values, such as build a simple language interpreter
or translate this code into another programming language.  Even without a
grammar we can find these tokens useful as we'll see with the SQL example.

=head2 Identifying tokens

The first step in building a lexer is identifying the tokens you wish to
parse.  Let's take another look at the SQL.

  select the_date as "date",
  round(months_between(first_date,second_date),0) months_old
  ,product,extract(year from the_date) year
  ,case
    when a=b then 'c'
      else 'd'
    end tough_one
  from ...
  where ...

We really don't care about anything after the C<from> keyword.  In looking at
this closer, we see that everything we do care about is immediately prior to a
comma or the 'from' keyword.  However, splitting on commas isn't enough as we
have some commas embedded in function parentheses.

The first thing we need to do is to identify the various things we can match
with simple regular expressions.

These "things" appear to be parentheses, commas, operators, keywords and
random text.  A first pass at it might look something like this:

  my $lparen  = qr/\(/;
  my $rparen  = qr/\)/;
  my $keyword = qr/(?i:select|from|as)/; # this is all this problem needs
  my $comma   = qr/,/;
  my $text    = qr/(?:\w+|'\w+'|"\w+")/;
  my $op      = qr{[-=+*/<>]};

The text matching is somewhat naive and we might want C<Regexp::Common> for
some of the regular expressions, but for now we'll keep this simple.

The operators are a bit more involved as we'll assume that some SQL might have
math statements embedded in them.

Then we create the actual lexer.  One way to do this is to make our own lexer.
It might look something like this:

  sub lexer {
      my $sql = shift;
      return sub {
          LEXER: {
              return ['KEYWORD', $1] if $sql =~ /\G ($keyword) /gcx;
              return ['COMMA',   ''] if $sql =~ /\G ($comma)   /gcx;
              return ['OP',      $1] if $sql =~ /\G ($op)      /gcx;
              return ['PAREN',    1] if $sql =~ /\G $lparen    /gcx;
              return ['PAREN',   -1] if $sql =~ /\G $rparen    /gcx;
              return ['TEXT',    $1] if $sql =~ /\G ($text)    /gcx;
              redo LEXER             if $sql =~ /\G \s+        /gcx;
          }
      };
  }
  my $lexer = lexer($sql);
  while (defined (my $token = $lexer->())) {
      # do something with the token
  }

Without going into the detail of how that works, it's fair to say that this
is not the best solution.  By looking at the original post this came from,
(http://perlmonks.org/index.pl?node_id=472701), we find that we need to make
two passes through the data to extract what we want.  Why this is the case
is an exercise left for the reader.

To make this simpler, we're going to use the C<HOP::Lexer> module from the
CPAN.  This module, described by Mark Jason Dominus in his book "Higher Order
Perl", makes creating lexers a rather trivial task and makes them a bit more
powerful than what we have above.  Here's our code:

  use HOP::Lexer 'make_lexer';
  my @sql = $sql;
  my $lexer = make_lexer(
      sub { shift @sql },
      [ 'KEYWORD', qr/(?i:select|from|as)/          ],
      [ 'COMMA',   qr/,/                            ],
      [ 'OP',      qr{[-=+*/]}                      ],
      [ 'PAREN',   qr/\(/,      sub { [shift,  1] } ],
      [ 'PAREN',   qr/\)/,      sub { [shift, -1] } ],
      [ 'TEXT',    qr/(?:\w+|'\w+'|"\w+")/, \&text  ],
      [ 'SPACE',   qr/\s*/,     sub {}              ],
  );

  sub text {
      my ($label, $value) = @_;
      $value =~ s/^["']//;
      $value =~ s/["']$//;
      return [ $label, $value ];
  }

This certainly doesn't look any easier to read, but bear with me.

The C<make_lexer> subroutine takes as its first argument an iterator which
returns the text to match every time it's called.  In our case, we only have
one snippet of text to match, so we merely shift it off of an array.  If we
were reading lines from a log file, the iterator would be quite handy.

After the first argument, we have a series of array references.  Each reference
takes two mandatory and one optional argument:

  [ $label, $pattern, $optional_subroutine ]

The C<$label> will be used as the name of the token.  The pattern should match
whatever the label is identifying.  The third argument, a subroutine
reference, takes as arguments the label and the I<text> the label matched and
returns whatever you wish for a token.  We'll get to that in a moment.  First,
let's consider how we typically use the C<make_lexer> subroutine.

  [ 'KEYWORD', qr/(?i:select|from|as)/ ],

An example of how we might transform the data before making the token is as
follows:

  [ 'TEXT', qr/(?:\w+|'\w+'|"\w+")/, \&text  ],

As mentioned previously, our regular expression might be naive, but we'll leave
that for now and focus on the C<&text> subroutine.

  sub text {
      my ($label, $value) = @_;
      $value =~ s/^["']//;
      $value =~ s/["']$//;
      return [ $label, $value ];
  }

This says "take the label and the value, strip leading and trailing quotes
from the value and return them in an array reference".

To strip the whitespace, something we don't care about, we simply return
nothing:

 [ 'SPACE', qr/\s*/, sub {} ],

Now that we have our lexer, let's put it to work.  Remember that we had
decided that column aliases were the C<TEXT> not in parentheses but
immediately prior to commas or the C<from> keyword.  But how do we know if
we're inside of parentheses?  We're going to cheat a little bit:

  [ 'PAREN', qr/\(/, sub { [shift,  1] } ],
  [ 'PAREN', qr/\)/, sub { [shift, -1] } ],

With that, we can add a one whenever we get to an opening parenthesis and
subtract it when we get to a closing one.  Whenever the result is zero, we
know that we're outside of parentheses.

We can get the tokens by repeatedly calling the C<$lexer> iterator.

  while ( defined (my $token = $lexer->() ) { ... }

And the tokens would look like this:

  [  'KEYWORD',      'select' ]
  [  'TEXT',       'the_date' ]
  [  'KEYWORD',          'as' ]
  [  'TEXT',           'date' ]
  [  'COMMA',             ',' ]
  [  'TEXT',          'round' ]
  [  'PAREN',               1 ]
  [  'TEXT', 'months_between' ]
  [  'PAREN',               1 ]

And so on ...

Here's how we process the tokens:

   1:  my $inside_parens = 0;
   2:  while ( defined (my $token = $lexer->()) ) {
   3:      my ($label, $value) = @$token;
   4:      $inside_parens += $value if 'PAREN' eq $label;
   5:      next if $inside_parens || 'TEXT' ne $label;
   6:      if (defined (my $next = $lexer->('peek'))) {
   7:          my ($next_label, $next_value) = @$next;
   8:          if ('COMMA' eq $next_label) {
   9:              print "$value\n";
  10:          }
  11:          elsif ('KEYWORD' eq $next_label && 'from' eq $next_value) {
  12:              print "$value\n";
  13:              last; # we're done
  14:          }
  15:      }
  16:  }

This is pretty straightforward, but there are some tricky bits.  Each token is
a two element array reference, so line three makes the label and value fairly
explicit.  Lines four and five use the "cheat" we mentioned for handling
parentheses.  Five also skips anything which isn't text and therefore cannot
be a column alias.

Line six is a bit odd.  In the C<HOP::Lexer>, passing the string C<peek> to
the lexer will return the next token without actually advancing the C<$lexer>
iterator.  From there, it's straightforward logic to find out if the value we
have is a column alias which matches our criteria.

Putting all of this together we have:

  #!/usr/bin/perl

  use strict;
  use warnings;
  use HOP::Lexer 'make_lexer';

  my $sql = <<END_SQL;
  select the_date as "date",
  round(months_between(first_date,second_date),0) months_old
  ,product,extract(year from the_date) year
  ,case
    when a=b then 'c'
      else 'd'
        end tough_one
        from XXX
  END_SQL

  my @sql = $sql;
  my $lexer = make_lexer(
      sub { shift @sql },
      [ 'KEYWORD', qr/(?i:select|from|as)/          ],
      [ 'COMMA',   qr/,/                            ],
      [ 'OP',      qr{[-=+*/]}                      ],
      [ 'PAREN',   qr/\(/,      sub { [shift,  1] } ],
      [ 'PAREN',   qr/\)/,      sub { [shift, -1] } ],
      [ 'TEXT',    qr/(?:\w+|'\w+'|"\w+")/, \&text  ],
      [ 'SPACE',   qr/\s*/,     sub {}              ],
  );

  sub text {
      my ( $label, $value ) = @_;
      $value =~ s/^["']//;
      $value =~ s/["']$//;
      return [ $label, $value ];
  }

  my $inside_parens = 0;
  while ( defined ( my $token = $lexer->() ) ) {
      my ( $label, $value ) = @$token;
      $inside_parens += $value if 'PAREN' eq $label;
      next if $inside_parens || 'TEXT' ne $label;
      if ( defined ( my $next = $lexer->('peek') ) ) {
          my ( $next_label, $next_value ) = @$next;
          if ( 'COMMA' eq $next_label ) {
              print "$value\n";
          }
          elsif ( 'KEYWORD' eq $next_label && 'from' eq $next_value ) {
              print "$value\n";
              last; # we're done
          }
      }
  }

And that prints out the column aliases.

  date
  months_old
  product
  year
  tough_one

So are we done?  No, probably not.  What we really need now are many other
examples of the SQL generated in the first problem statement.  Maybe the
C<&text> subroutine is naive.  Maybe there are other operators we forgot.
Maybe there are floating point numbers embedded in the SQL.  When we are
forced to lex data by hand, fine-tuning the lexer to match your actual data
can take a few tries.

It's also important to note that precedence is very important here.  Each
array reference passed to C<&make_lexer> is evaluated in the order it's
passed.  If we passed the 'TEXT' array reference before the 'KEYWORD' array
reference, the 'TEXT' regular expression would match keywords before the
'KEYWORD' could match, thus generating spurious results.

Happy lexing!