The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Language::FormulaEngine::Parser - Create parse tree from an input string

VERSION

version 0.02

SYNOPSIS

  my $parse_tree= Language::FormulaEngine::Parser->new->parse($string);

DESCRIPTION

This class scans tokens from an input string and builds a parse tree. In compiler terminology, it is both a Scanner and Parser. It performs a top-down recursive descent parse, because this is easy and gives good error messages. It only parses strings, but leaves room for subclasses to implement streaming. By default, the parser simply applies a Grammar to the input, without checking whether the functions variables exist, but can be subclassed to do more detailed analysis during the parse.

The generated parse tree is made up of Function nodes (each infix operator is converted to a named function) and each Function node may contain Symbols, Strings, Numbers, and other Function nodes. The parse tree can be passed to the Evaluator for instant execution, or passed to the Compiler to generate an optimized perl coderef. The parse tree is lightweight, and does not include token/context information; this could also be added by a subclass.

PUBLIC API

parse

Parse a new input text, updating all derived attributes with the result of the operation. It returns the value of "parse_tree" (which is undef if the parse failed). On failure, the exception is stored in "error" and other attributes like "token_pos" may contain useful diagnostic information.

parse_tree

This holds the generated parse tree, or undef if the parse failed. See "PARSE NODES".

error

This is undef if the parse succeeded, else an error message describing the syntax that ended the parse.

functions

A set (hashref) of all function names encountered during the parse.

symbols

A set (hashref) of all non-function symbols encountered. (variables, constnts, etc.)

reset

Clear the results of the previous parse, to re-use the object. Returns $self for chaining.

deparse

  my $formula_text= $parser->deparse($tree);

Return a canonical formula text for the parse tree, or a parse tree that you supply.

EXTENSIBLE API

These methods and attributes are documented for purposes of subclassing the parser.

input

The input string being scanned. Code within the parser should access this as $self->{input} for efficiency.

input_pos

Shortcut for pos($self->{input}).

token_type

Type of current token scanned from input. Code within the parser should access this as $self->{token_type} for efficiency.

token_value

Value of current token scanned from input, with escape sequences and etc resolved to a sensible perl value. Code within the parser should access this as $self->{token_value} for efficiency.

token_pos

An offset within input where this token started. Code within the parser should access this as $self->{token_pos} for efficiency.

next_token

Advance to the next token, replacing the values of token_ variables and updating input_pos. Returns the token_type, of which all are true except EOF which has a type of 0, so this also means the function returns true if it parsed a token and false if it reached EOF. It dies if no token could be parsed. If you call next_token again after the eof token, it throws an exception.

This method is a wrapper around "scan_token". Override that method to add new token types.

scan_token

Pattern-match the next token, and either return $type => $value or an empty list if the syntax is invalid. This is intended to be overridden by subclasses.

consume_token

  return $self->consume_token if $self->{token_type} eq $desired_type;

This is a shorthand for returning the current token_value while also calling next_token.

token_context

  my $text= $self->token_context(%options);

Default behavior generates a string like:

  "'blah blah' on line 15, char 12"

Passing token_context(multiline => 1) generates a string like

  "Expected something else at line 15, char 16\n" .
  "blah blah blah token blah blah\n" .
  "               ^^^^^\n"

Multiline additionally takes arguments as described in "format_context_multiline" in Language::FormulaEngine::Parser::ContextUtil.

GRAMMAR

Parse Rules

The default grammar implements the following rules:

  expr      ::= or_expr
  or_expr   ::= and_expr ( 'or' and_expr )*
  and_expr  ::= not_expr ( 'and' not_expr )*
  not_expr  ::= ( 'not' | '!' ) cmp_expr | cmp_expr
  cmp_expr  ::= sum_expr ( ( '=' | '==' | '<>' | '\u2260' | '<' | '<=' | '>' | '>=' ) sum_expr )*
  sum_expr  ::= prod_expr ( ('+' | '-') prod_expr )*
  prod_expr ::= ( unit_expr ('*' | '/') )* unit_expr
  unit_expr ::= '-' unit_expr | Identifier '(' list ')' | '(' (expr|list) ')' | Identifier | Number | String
  list      ::= expr ( ',' expr )* ','?

ident, num, str, and all the punctuation symbols are tokens.

The parser uses a Recursive Descent algorithm implemented as the following method calls. Each method consumes tokens from $self and return a "PARSE NODES":

parse_expr
parse_or_expr
parse_and_expr
parse_not_expr
parse_cmp_expr
parse_sum_expr
parse_prod_expr
parse_unit_expr
parse_list

Token Types

'Number'

All the common decimal representations of integers and floating point numbers which perl can parse. Optional decimals and decimal point followed by decimals and optional exponent, ending at either the end of the input or a non-alphanumeric.

'String'

A single-quoted or double-quoted string, treating a double occurrence of the quote character to mean a literal quote character. ("Pascal style")

Keywords...

Keywords include the "word" tokens like 'OR', but also every text literal seen in a parse rule such as operators and punctuation. The token_type of the keyword is the canonical version of the keyword, and the token_value is the actual text that was captured. The pattern matches the longest keyword possible.

'Identifier'

Any alpha (or underscore) followed by any run of alphanumerics, (including underscore and period).

Parse Nodes

The parse tree takes a minimalist approach to node classification. In this default implementation, numbers are represented as plain perl scalars, strings and symbolic references are represented as blessed scalar refs, and function calls are represented as blessed Lisp-style arrayrefs.

A blessed node only needs to support one method: ->evaluate($namespace).

The class name of the blessed nodes should be ignored. A function is anything which can("function_name"), a string is anything which can("string_value") and a symbolic reference is anything which can("symbolic_name"). Also, the blessed strings automatically stringify to their value, behaving almost like plain perl scalars.

Subclasses of Parser should implemnt new node types as needed. You probable also need to update "deparse".

The parser rules create nodes by methods on the Parser class, for easy subclassing.

new_call
  $node= $parser->new_call( $function_name, $parameters );

Generate a node for a function call. The returned node has attributes function_name and parameters

new_symbol
  $node= $parser->new_symbol($symbol_name);

A reference to a symbolic value (i.e. variable or constant). It has one attribute symbol_name.

new_string
  $node= $parser->new_string($string_value);

A string literal. It has an attribute string_value holding the raw value.

new_number
  $plain_scalar= $parser->new_number($value);

A numeric constant. It has an attribute number_value holding the raw value.

get_negative
  $negative_node= $parser->get_negative( $node );

Utility method to get the "opposite of" a parse node. By default, this wraps it with the function 'negative', unless it already was that function then it unwraps the parameter. It performs simple negation on numbers.

AUTHOR

Michael Conrad <mconrad@intellitree.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2019 by Michael Conrad, IntelliTree Solutions llc.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.