The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

yagg::Tutorial - Tutorial for yagg

SYNOPSIS

  # To use the generator
  ./yagg -m nonterminals.yg terminals.lg
  ./output/progs/generate 5

OVERVIEW

This tutorial will show you how to use yagg, by way of two examples. In the first example, we create a simple logical expression generator from scratch. In the second example, we create a more sophisticated logical expression generator from existing parser/lexer input files, such as those used by YACC/Bison and LEX/FLEX. These examples, plus another more sophisticated fault tree generator are included with the distribution in the examples/ directory.

It is assumed that the reader knows a little about formal grammars. Ideally, the reader would have some experience writing grammars for input to parser generators like YACC and Bison.

EXAMPLE: GENERATOR IMPLEMENTED FROM SCRATCH

Let's say that we want to create a logical expression generator that generates expressions containing binary operators, parentheses, negation, and atomic propositions. Examples might be -(p or q), p => -q, and p and (q or r).

Here is a simple grammar that would create such expressions:

  wfe : wfe binary_operator wfe |
        "(" wfe ")" |
        unary_operator wfe |
        atomic ;
  binary_operator : "<=>" | "and" | "or" | "=>" ;
  unary_operator : "-" ;
  atomic : "p" | "q" | "r" ;

You can place the above grammar verbatim into a "language grammar" input file, adding a line containing %% at the top of the file. This is enough for the generator to do its job. Running the generator as:

  $ yagg -r2 foo.yg
  

will generate the generator, compile it, and run it to create expressions of length 2.

Input Files

Another way to write the input files is to put the productions for terminals such as "(" and atomic into a "terminal generator" file. This approach is more similar to the approach taken by tools such as Lex and YACC, which separate the lexical analysis of tokens from the grammar-based parsing. Here's the resulting grammar file

Here's the grammar file, logical_expression.yg:

  %%

  wfe : wfe BINARY_OPERATOR wfe |
        LEFT_PAREN wfe RIGHT_PAREN |
        UNARY_OPERATOR wfe |
        ATOMIC ;

Here we're using upper-case as a convention to indicate the terminals. Notice that the strings associated with BINARY_OPERATOR, UNARY_OPERATOR, and ATOMIC have been removed. They are located in the terminal grammar file, logical_expression.lg:

  %%

  ( "<=>" | "and" | "or" | "=>" ) return BINARY_OPERATOR;

  "-" return UNARY_OPERATOR;

  "(" return LEFT_PAREN;

  ")" return RIGHT_PAREN;

  ( "p" | "q" | "r" ) return ATOMIC;

This file is similar to a LEX input file, but instead of regular expressions you must provide one of several terminal formats. (We'll describe these shortly.) For example, the last rules says that an atomic proposition is either p, q, or r.

Generating and Running the Generator

Next, use these input files to generate the generator:

  $ yagg -m logical_expression.yg logical_expression.lg

And then run the generator:

  $ ./output/progs/generator 3

You can use the -r flag to run the generator for a given length. You can also use the -o flag to place the output in a different directory. -o can specify a directory on a remote machine using any of the rsync formats. In this case, -m and -r will execute on the remote machine instead of the local one.

Terminal Generation Formats

You'll notice that the generator prints out a number of very similar expressions, such as p and q versus q and p. You can control how terminals are generated by using one of several terminal specification formats:

Simple String ("p")

A simple string is used wherever the terminal appears.

Simple Alternation (( "p" | "q" ))

When such a terminal appears, multiple versions will be generated, one for each alternative.

Equivalence Alternation ([ "p" | "q" ])

Terminals of this type are treated as belonging to an equivalence class. That is, a "q" is only generated if the current string already has a "p". This is useful for terminals that represent things like variable names.

It's easier to explain equivalence with an example. Suppose you use the simple alternation: "p" | "q" for a terminal ATOMIC. The output for ATOMIC ATOMIC will be:

  p p
  p q
  q p
  q q

In this case, the first and last generated string are basically the same because they both have identical elements. Similarly, the second and third generated string are the same since they both have different elements. As far as our logical expressions example, we only really care about the first two cases. We can use the equivalence alternation form to get what we want: [ "p" | "q" ]. In this case, the output will be:

  p p
  p q
Equivalence Generator ([ "p#" ])

This form is the same as the equivalence alternation, except that the names are generated as they are needed by substituting "#" in the string with a number starting at 1, and the number of generated strings is unbounded. For example, if you use [ "p#" ] for ATOMIC, the output for ATOMIC ATOMIC will be:

  p1 p1
  p1 p2

If we want to remove the generation of strings with semantically redundant atomic propositions, we would need to modify the ATOMIC terminal specification like so:

  [ "p" | "q" | "r" ] return ATOMIC;

If you now rerun the yagg and string generator:

  $ yagg -r 3 logical_expression.yg logical_expression.lg

you will see that the strings have been constrained. Also note that the generation is faster this time---the comilation process only needs to recompile the implementation of the ATOMIC generator component.

More Information

This example is in the examples/logical_expression_simple/ directory of the distribution. See the README file.

EXAMPLE: GENERATOR IMPLEMENTED FROM AN EXISTING PARSER

In this second example, let's say that you've already created a parser for your format using LEX and YACC, or FLEX and Bison. If you have, you probably already have the following:

  • A lex.l input file for LEX/FLEX

  • A yacc.y input file for YACC/Bison

  • An include file for headers needed by the parser and lexer

  • Extra files for custom data types and whatnot needed for parsing

The following instructions tell you how to convert them into input files for yagg. In this example, I will use the example parser located in the examples/logical_expression_constrained/logical_expression_parser/ directory of the distribution. (The .y and .l files are in the src/logical_expression_parser subdirectory.) The resulting files are in the examples/logical_expression_constrained directory, along with a README file.

(By the way, if you have problems or have to alter the steps while working with your own files, please let me know so that I can update this tutorial.)

Prepare your terminal lex.lg file

(1)

Copy lex.l to lex.lg

(2)

Delete everything in the definitions section, except for an %{ ... %} and %option prefix=..., if they are there.

(3)

In the rules section:

(1)

Delete everything in the action blocks except any return TOKEN statements

(2)

Delete any start conditions from the rules

(3)

Delete any rules with empty action blocks

(4)

Replace any regular expressions in the rules with appropriate generators. (See above.)

(4)

In the code header block, comment out the #include for the parser-generated header file. (This would normally by y.tab.h, but you may have renamed it.)

(5)

If you used the -Pfoo flag to change the symbol prefix from yy, you'll need to add an %option prefix=foo.

Prepare your nonterminal yacc.yg file

(6)

Copy yacc.y to yacc.yg

(7)

For any rule that has a side-effect, add another "unaction" block immediately following that reverses the effect. You don't need to worry about any changes to $$---they will be handled automatically. If you have an undo block, and if any of the positional parameters such as $1 are pointers, you only need to delete them if (1) they would normally be deleted in the action block, and (2) if you actually use them in the unaction block.

If you compare the .yg file to the original .y file, you'll see that I added only one undo block, because all the other action blocks only manipulated $$.

(8)

Any functions that you call should not have side effects. If they do, you'll need to reverse them in the unaction block(s).

Prepare your include file and additional files

(9)

Create a directory like input_user_code_myproject/src/model.

(10)

Put all your source code in that directory, being careful to create proper subdirectories if you need them. For example, if you have #include "module/header.h", you'd want to put header.h in input_user_code_myproject/src/model/module. (You don't however, need to change the include to #include at all. The default C++ extension is .cc--you may need to rename your files. (Alternatively, you can copy the GNUmakefile and edit the value of CCFILEEXT.)

Generate the generator

(11)

Run the following command to create the C++ generator code.

  ./yagg -f -u input_user_code_myproject yacc.yg lex.lg

Build the generator

(12)

By default, the output is placed in the output/ directory. Change to that directory and run GNU make to compile it.

  cd output
  make

You can do this automatically by giving yagg the -m flag

Run the generator

(13)

Generate the strings by running the generate program in the progs/ directory, giving it a positive integer for the length of the strings to generate.

  ./progs/generate 9

Specify any length for the generated strings you want. An error message will be displayed if you choose too small a number.

More Information

This example is in the examples/logical_expression_constrained/ directory of the distribution. See the README file.

CONTROLLING GENERATED STRINGS

The most common problem you will encounter is exponential explosion of strings. If yagg produces strings that you think it should not, first check that the grammar actually says what you think it says.

For example, consider this grammar, which generates strings of "a" and "b" in any order (see ab.yg in examples/ab):

  node_list : node_list node | node ;
  node : "a" | "b" ;

Obviously we could move node to the terminal file and use equivalence classes to limit the generated strings. But for now, let's assume explore ways to limit the strings using the grammar only.

One way is to force all "a"'s to precede all "b"'s. We can do this by modifying the grammar (see ab_grammar_constrained.yg in examples/ab):

  node_list : a_list b_list | a_list | b_list ;
  a_list : a_list "a" | "a" ;
  b_list : b_list "b" | "b" ;

to change an unordered list of "a"'s and "b"'s so that "b"'s will follow "a"'s.

Another approach might be to keep the same grammar but add context-sensitive checks (see ab_check_constrained.yg in examples/ab). In order to do this, we need to pass the names of the nodes back up the grammar productions, which means that we need to define some return types and add action blocks:

  // Include some declarations that we need
  %{
  #include <list>
  #include <string>

  using namespace std;
  %}

  // Define the return types
  %union {
    list<string> text_list;
    string       text;
  }

  // Define the grammar production return types
  %type <text_list> node_list
  %type <text>      node

  %%

  node_list :
    node_list node
    {
      if ($$.size() > 0 && $$.back() == "b" && $2 == "a")
        yyerror("\"a\" can't follow \"b\"!");
      else
      {
        $$.push_back($2);
      }
    } |
    node
    {
      $$.push_back($1);
    };

  node :
    "a" { $$ = $1; } |
    "b" { $$ = $1; } ;

In this case, we check the node_list as we build it to make sure that we never add an "a" after a "b". The call to yyerror is how the generator knows that the string is invalid. Note that I didn't need to write any unaction blocks because the actions did not change any state other than $$.

Which method is better? I would use the latter only for constraints which truly are context sensitive. Restructuring the grammar also results in faster string generation because there is no time wasted generating strings that will only be invalidated later.

NOTE: this grammar takes advantage of a feature of the generator--that the %union isn't really used as a union, so you can define types of different (and dynamic) sizes. Bad things would happen if you tried to use this grammar with YACC or Bison. To fix it, you would need to change the %union types to pointers and add new and delete to the action blocks, and change "a" and "b" to tokens.

See ab_check_constrained_pointers.yg in examples/ab for these changes. You can also compare this to the ab_parser.y file for the parser in the ab_parser subdirectory.

For a more realistic example of limiting the generated strings, compare the fault_trees and fault_trees_constrained examples in the examples directory of the distribution.

CONTROLLING OUTPUT FORMATTING

To control output, first create a user_code/src/progs/ directory. Run yagg once so that it creates the output/ directory. Copy output/src/progs/generate.cc to user_code/src/progs/. Modify the Print_Strings function, which inserts a space by default between strings and adds a newline at the end.

Now, when you run yagg, add the -u user_code option so that your modified generate.cc will override the normal generated one. If you want more control, you can remove the whitespace insertion in Print_Strings, then make whitespace an explicit part of your grammar.

AUTHOR

David Coppit <david@coppit.org>