Author image Michael Roberts
and 1 contributors

NAME

Decl::Semantics::Parse - implements a parser specification.

VERSION

Version 0.01

SYNOPSIS

A parser, by nature, converts a text stream into a tree structure. You can get it to do things other than building a tree structure, but that is its inherent nature, because we human beings parse our incoming text (in the form of an audio stream) into abstract syntax trees in our heads while understanding things (well, depending on who you listen to, but it's a useful model). And of course, our computers work the same way. So somewhere in the process of getting from text - that is, code - into actions taken, every computer program or data structure goes through a phase of being an abstract tree, even if only in a potential sense.

Well, Decl happens to be built of trees, so naturally that's the default output of a parser built into this language. Now note: you can also define a parser that, given some text, outputs a callable code object. The Perl parser is used in just such a manner, and is the default parser for code in Decl. But because C::D tries to be as flexible as possible, you can override that, either at the global level or in any particular code block, as I'll illustrate below. So if you build a parser that returns a callable object in some way (whether by building code in Perl, or by doing something fancy with Inline::Java for all I know), then you can use it to define code objects on the fly.

The parsers in Decl are based on those in Mark Jason Dominus' marvelous, marvelous book Higher-Order Perl. The standard setup in HOP is to define a lexer (to break the text up into tokens), then to pass the token stream through the parser itself. This makes the entire process a lot easier to organize, and since tokenization is already a useful tool, I decided to go with it.

BASIC EXAMPLE: REGEX

Let's start with one of his examples, shall we? He provides a regexp parser on page 436 in Chapter 8. In Decl, it looks like this:

   parse regex
      tokens
         ATOM   "\\x[0-9a-fA-F]{0,2}|\\\d+|\\."
         PAREN  "[()]"
         QUANT  "[*+?]"
         BAR    "|"
         ATOM   "."
         
      rules
         regex
            alternative BAR regex
            alternative
         alternative
            qatom alternative
            (nothing)
         qatom
            atom QUANT
            atom
         atom
            ATOM
            "(" regex ")"
            

Given the input (a|b)+(c|d*) - see below for how to pass input to a parser - this returns the nodal structure

   regex
      alternative
         qatom
            atom
               regex
                  atom
                     ATOM "a"
                  BAR "|"
                  atom
                     ATOM "b"
            QUANT "+"
         atom
            PAREN "("
            regex
               atom
                  ATOM "c"
               BAR
               qatom
                  atom
                     ATOM "d"
                  QUANT "*"
            PAREN ")"
         

That's a pretty lanky structure, but it does serve the purpose of getting text into a data structure you can do stuff with (like searching it or walking it or passing it off to a template for some other purpose).

We can tweak it a little. If you tack an asterisk onto any tag in the grammar, the output will omit that level from the output tree:

   parse regex
      tokens
         ATOM   "\\x[0-9a-fA-F]{0,2}|\\\d+|\\."
         PAREN  "[()]"
         QUANT  "[*+?]"
         BAR    "|"
         ATOM   "."
         
      rules
         regex
            alternative* BAR regex
            alternative*
         alternative
            qatom alternative*
            (nothing)
         qatom
            atom QUANT
            atom
         atom
            ATOM*
            "("* regex* ")"*
            

Given the input (a|b)+(c|d*), this returns the nodal structure

   regex
      qatom
         atom
            atom "a"
            BAR "|"
            atom "b"
         QUANT "+"
      atom
         atom "c"
         BAR
         qatom
            atom "d"
            QUANT "*"
            

That arguably preserves the semantics of the original regex, without keeping the syntactic overhead, and will probably be more useful.

Once our parser is defined, it becomes a new tag, so

   regex "(a|b)+(c|d*)"
   

is now shorthand for the tree structure shown above. To insert at build time, we use

   <= (regex) "(a|b)+(c|d*)"
   

For a longer non-callable macro insertion, we'll want a better example, but let's assume something like this:

   <= (regex)
      lkjlkjsdf
      lkjljlksjdf
      lkjljsdf
      

USING A TOKENIZER ALONE

A parser can also be run as a tokenizer alone, returning a stream of tokens. This is used for the text streams in PDF::Declarative, where commands can be interpersed into the text (that's still a work in progress, of course). If you override that parser, you can build PDFs using whatever text stream formalism you find useful.

   example here after it's written
   

To iterate over that stream, we treat it as a filter on a given text stream, like this:

   do {
      ^foreach token in my_text|pdf_tokenizer {
         if (ref $token eq 'ARRAY') {
            # handle a command token
         } else {
            # we have a word
         }
      }
   }

A token stream is a special type of stream, actually - the iterator returns strings for words, and arrayrefs for identified tokens, which are generally equivalent to commands. This distinguishes it from normal data iterators, which return an arrayref for each row. I mention this because it affects the way you build your ^foreach specification; a data iterator returning arrayrefs would allow you to provide two local variables, but a token stream can't, because some of the tokens aren't arrayrefs.

To call a tokenizer from outside Decl, you'd do something like this:

   use Decl (-nofilter PDF::Declarative);
   
   $tree = new Decl;
   $tree->load (<<EOF);
      text my_text
         ...
   EOF
   
   $iterator = $tree->iterate ("my_text|pdf_tokenizer");
   while ($token = $iterator->next) {
      if (ref $token eq 'ARRAY') {
         # handle a command token
      } else {
         # we have a word
      }
   }

CALLABLE PARSED OBJECTS - EXAMPLE: CALCULATE

A parser can also skip right past the nodal structure stage, transforming your language directly into callable code. The Higher-Order Perl example that best fits that model is the calculator; Dominus actually uses the calculator as his first example, but I thought the regexp was a simpler initial example.

First, let's translate the HOP calculator grammar into C::D style, allowing it to generate a nodal structure. Even if you define a parser to be able to build a callable object, its parse tree is still available if you ask for it explicitly, so even the decorated parser below, if used in a non-callable context, will generate a parse tree for you. It's just easier to illustrate without the extra syntax.

   grammar here
   

Now let's go ahead and add the specifications necessary to generate a callable object. These are mostly making use of the "actions" feature.

   grammar here
   

Now we have a number of different ways to use this parser. First is simply as a parser to extract the parse tree of whatever we defined; I'll skip that, because it was covered in the previous section.

Second, we can call it just like any other code-generating object, say as an event handler. The default parser for code snippets is "perl", of course, but you can direct Decl to use any other code-generating parser like this:

   on my_event calculate < {
      something
   }
   

That's pretty boring in this case, because the grammar we've defined doesn't permit us to use parameters, so we will always calculate the same thing. Eventually, I'll need and use this feature in some actual application, and I'll try to remember to link to it here.

Finally, we can just call the parser from Perl, like this:

   parser calculate
      ...
      
   do {
      print ^calculate ("1 + 2 * (4 - 5)") . "\n";
   }

For simple parsers, this last case will probably be the most useful.

CALLING A PARSER FROM OUTSIDE CLASS::DECLARATIVE

Of course, we can also call the parser from outside Decl, like this:

   use Decl (-nofilter);
   
   $tree = new Decl;
   $tree->load (<<EOF);
      parse calculate
         ...
   EOF
   
   $result = $tree->parser('calculate')->parse('1 + 2 * (4 - 5)');
   

Here, $result gets the value of -1. If you call a non-code-generating parser like this, you'll get a Decl::Node structure back.

EXAMPLE: CLASS::DECLARATIVE'S OWN PARSER The standard parser for a C::D line is this:

   parse Dline
      tokens
         WORD
         LPAREN    "\("
         RPAREN    "\)"
         LBRACK    "\["
         RBRACK    "\]"
         COMMA     ","
         EQUALS    "="
         STRING    
         PARSEFLAG "<"
      actions
      rules

That's the actual parser used by default in Decl. You can override the line parser for a given tag; we use this for the 'select' tag, for instance. The indentation structure and bracketing is currently handled by Parse::Indented, and that probably won't change (but you never know).

EXAMPLE: SELECT PARSER

The select tag uses SQL to retrieve information from data iterators, and since SQL is, well, a standard query language (kind of), it's supported natively in Decl, mostly because we already have this fancy parser just sitting around ready to do that kind of thing. The nice thing, of course, is that means you don't have to write an SQL parser, because I've already done it for you.

   parse SQLselect

IMPLEMENTATION

This particular class implements the parse node in the specification structure; the class Decl::Parser implements the parser itself. In other words, here we are concerned with building a Decl::Parser object that will then be asked to do actual parsing. The tags claimed by user-defined parsers are also registered in this phase, constituting macros.

defines(), tags_defined()

Called by Decl::Semantics during import, to find out what xmlapi tags this plugin claims to implement.

build_payload ()

The build function is then called when this object's payload is built (i.e. in the stage when we're adding semantics to our parsed syntax). It builds the parser and registers its tag with the application. Instances are handled by Decl::Semantics::Macro.

AUTHOR

Michael Roberts, <michael at vivtek.com>

BUGS

Please report any bugs or feature requests to bug-decl at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Decl. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

LICENSE AND COPYRIGHT

Copyright 2010 Michael Roberts.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.