NAME

Parse::FSM - Deterministic top-down parser based on a Finite State Machine

SYNOPSIS

  use Parse::FSM;
  $fsm = Parse::FSM->new;
  
  $fsm->prolog($text);
  $fsm->epilog($text);
  $fsm->add_rule($name, @elems, $action);
  $fsm->start_rule($name);
  
  $fsm->parse_grammar($text);

  $fsm->write_module($module);
  $fsm->write_module($module, $file);
  
  $parser = $fsm->parser; # isa Parse::FSM::Driver
  $parser->input(\&lexer);
  $result = $parser->parse;
  
  # script
  perl -MParse::FSM - Grammar.yp Parser::Module
  perl -MParse::FSM - Grammar.yp Parser::Module lib\Parser\Module.pm

DESCRIPTION

This module compiles the Finite State Machine used by the Parse::FSM::Driver parser module.

It can be used by a sequence of add_rule calls, or by parsing a yacc-like grammar in one go with parse_grammar.

It can be used as a script to generate a module from a grammar file.

The result of compiling the parser can be used immediately by retrieving the parser object, or a pre-compiled module can be written to disk by write_module. This module can then be used by the client code of the parser.

As usual in top-down parsers, left recursion is not supported and generates an infinite loop. This parser is deterministic and does not implement backtracking.

METHODS - SETUP

new

Creates a new object.

METHODS - BUILD GRAMMAR

start_rule

Name of the grammar start rule. It defaults to the first rule added by add_rule.

prolog, epilog

Perl code to include in the generated module near the start of the generated module and near the end of it.

add_rule

Adds one rule to the parser.

  $fsm->add_rule($name, @elems, $action);

$name is the name of the rule, i.e. the syntactic object recognized by the rule.

@elems is the list of elements in sequence needed to recognize this rule. Each element can be one of:

  • A string that will match with that token type from the lexer.

    The empty string is used to match the end of input and should be present in the grammar to force the parser to accept all the input;

  • An array reference of a list of all possible tokens to accept at this position.

  • A subrule name inside square brackets, optionally followed by a repetition character that asks the parser to recursively descend to match that subrule at the current input location.

    The accepted forms are:

    [term] - recurse to the term rule;

    [term]? - term is optional;

    [term]* - accept zero or more terms;

    [term]+ - accept one or more terms;

    [term]<+,> - accept one or more terms separated by commas, any token type can be used instead of the comma;

$action is the Perl text of the action executed when the rule is recognized, i.e. all elements were found in sequence.

It has to be enclosed in brackets {}, and can use the following lexical variables that are declared by the generated code:

  • $self : object pointer;

  • @item : values of all the tokens or rules identified in this rule. The subrule call with repetitions return an array reference containing all the found items in the subrule;

parse_grammar

Parses the given grammar text and adds to the parser. Example grammar follows:

  {
    # prolog
    use MyLibrary;
  }
  
  main   : (number | name)+ <eof> ;
  number : 'NUMBER' { $item[0][1] } ; # comment
  name   : 'NAME'   { $item[0][1] } ; # comment
  
  expr   : <list:    number '+' number > ;
  
  <start: main >
  
  {
    # epilog
    sub util_method {...}
  }
prolog

If the text contains a code block surrounded by braces before the first rule definition, the text is copied without the external braces to the prolog of generated module.

epilog

If the text contains a code block surrounded by braces after the last rule definition, the text is copied without the external braces to the epilog of generated module.

statements

Statements are either rule definitions of directives and end with a semi-colon ;. Comments are as in Perl, from a hash # sign to the end of the line.

rule

A rule defines one sentence to match in the grammar. The first rule defined is the default start rule, i.e. the rule parsed by default on the input. A rule name must start with a letter and contain only letters, digits and the underscore character.

The rule definition follows after a colon and is composed of a sequence of tokens (quoted strings) and sub-rules, to match in sequence. The rule matches when all the tokens and sub-rules in the definition match in sequence.

The top level rule should end with <eof> to make sure all input is parsed.

The rule can define several alternative definitions separated by '|'.

The rule definition finishes with a semi-colon ';'.

A rule can call an anonymous sub-rule enclosed in parentheses.

action

The last item in the rule definition is a text delimited by {} with the code to execute when the rule is matched. The code can use $self to refer to the Parser object, and @item to refer to the values of each of the tokens and sub-rules matched. The return value from the code defines the value of the rule, passed to the upper level rule, or returned as the parse result.

If no action is supplied, a default action returns an array reference with the result of all tokens and sub-rules of the matched sentence.

quantifiers

Every token or sub-rule can be followed by a repetition specification: '?' (zero or one), '*' (zero or more), '+' (one or more), or '<+,>' (comma-separated list, comma can be replaced by any token).

directives

Directives are written with angle brackets.

<eof>

Can be used in a rule instead of the empty string to represent the end of input.

<list: RULE TOKEN RULE >

Shortcut for creating lists of operators separated by tokens, returns the list of rule and token values.

<start: START_RULE >

Defines the start rule of the grammar. By default the first defined rule is the start rule; use <start:> to override that.

METHODS - USE PARSER

parser

Computes the Finite State Machine to execute the parser and returns a Parse::FSM::Driver object that implements the parser.

Useful to build the parser and execute it in the same program, but with the run-time penalty of the time to setup the state tables.

write_module

Receives as input the module name and the output file name and writes the parser module.

The file name is optional; if not supplied is computed from the module name by replacing :: by / and appending .pm, e.g. Parse/Module.pm.

The generated code includes parse_XXX functions for every rule XXX found in the grammar, as a short-cut for calling parse('XXX').

PRE-COMPILING THE GRAMMAR

The setup of the parsing tables and creating the parsing module may take up considerable time. Therefore it is useful to separate the parser generation phase from the parsing phase.

precompile

A parser module can be created from a yacc-like grammar file by the following command. The generated file (last parameter) is optional; if not supplied is computed from the module name by replacing :: by / and appending .pm, e.g. Parse/Module.pm:

  perl -MParse::FSM - Grammar.yp Parser::Module
  perl -MParse::FSM - Grammar.yp Parser::Module lib\Parser\Module.pm

This is equivalent to the following Perl program:

  #!perl
  use Parse::FSM;
  Parse::FSM->precompile(@ARGV);

The class method precompile receives as arguments the grammar file, the generated module name and an optional file name, and creates the parsing module.

AUTHOR

Paulo Custodio, <pscust at cpan.org>

ACKNOWLEDGEMENTS

Calling pre-compiler on import borrowed from Parse::RecDescent.

BUGS and FEEDBACK

Please report any bugs or feature requests through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Parse-FSM.

LICENSE and COPYRIGHT

Copyright (C) 2010-2011 Paulo Custodio.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.