Regexp::ERE - extended regular expressions and finite automata
use Regexp::ERE qw( &ere_to_nfa &nfa_inter &nfa_to_regex &nfa_to_input_constraints &nfa_to_dfa &dfa_to_min_dfa ); # condition 1: begins with abc or def my $nfa1 = ere_to_nfa('^(abc|def)'); # condition 2: ends with 123 or 456 my $nfa2 = ere_to_nfa('(123|456)$'); # condition 1 and condition 2 my $inter_nfa = nfa_inter($nfa1, $nfa2); # compute extended regular expression (string) my $ere = nfa_to_regex($inter_nfa); # compute perl regular expression my $perlre = nfa_to_regex($inter_nfa, 1); # compute weaker input constraints suitable for widgets my ($input_constraints, $split_perlre) = nfa_to_input_constraints($inter_nfa); # minimal dfa (simpler regular expression happens to result) my $nfa3 = ere_to_nfa('^(a|ab|b)*$'); my $dfa3 = nfa_to_dfa($nfa3); my $min_dfa3 = dfa_to_min_dfa($dfa3); my $ere3 = nfa_to_regex($min_dfa3);
Pure-perl module for:
Parsing POSIX Extended Regular Expressions ($ere) into Non-Deterministic Finite Automata ($nfa)
$ere
$nfa
Manipulating $nfas (concatenating, or-ing, and-ing)
Computing Deterministic Finite Automata ($dfas) from $nfas (powerset construction)
$dfa
Computing minimal $dfas from $dfas (Hopcroft's algorithm)
Computing $eres or Perl Regular Expressions from $nfa or $dfa (Warshall algorithm)
Heuristically deriving (possibly weaker) constraints from a $nfa or $dfa suitable for display in a graphical user interface, i.e. a sequence of widgets of type 'free text' and 'drop down';
Example: '^(abc|def)' => $nfa => [['abc', 'def'], 'free text']
'^(abc|def)'
[['abc', 'def'], 'free text']
$ere -> $nfa -> $tree -> $regex ($ere or $perlre) -> $input_constraints The second argument of -> $regex conversions is an optional boolean, true : conversion to a compiled perl regular expression false: conversion to an ere string The -> $input_constraints conversions return a pair ( $input_constraints: aref as described at tree_to_input_constraints() $split_perlre : a compiled perl regular expression )
A set of unicode characters.
Extended regular expression (string). See ere_to_nfa($ere) for the exact syntax.
ere_to_nfa($ere)
Perl regular expression
Non-deterministic finite automaton
Deterministic finite automaton (special case of $nfa)
Intermediate hierarchical representation of a regular expression (which still can be manipulated before stringification), similar to a parse tree (but used for generating, not for parsing).
Ad-hoc data structure representing a list of gui-widgets (free text fields and drop-down lists), a helper for entering inputs conforming to a given $nfa.
Each of the documented subroutines can be imported, for instance use ERE qw(&ere_to_nfa &nfa_match);.
use ERE qw(&ere_to_nfa &nfa_match);
WARNING: $char_classes must be created exclusively by char_to_cc() or interval_list_to_cc() for equivalent character classes to be always the same array reference. For the same reason, $char_classes must never be mutated.
$char_class
char_to_cc()
interval_list_to_cc()
In this implementation, the state transitions of a $nfa are based upon character classes (not single characters). A character class is an ordered list of disjoint, non-mergeable intervals (over unicode code points, i.e. positive integers).
$char_class = [ [ $low_0, $high_0 ] # $interval_0 , [ $low_1, $high_1 ] # $interval_1 , ... ]
Constraints:
1: 0 <= $$char_class[$i][0] (0 <= low) 2: $$char_class[$i][1] <= MAX_CHAR (high <= MAX_CHAR) 3: $$char_class[$i][0] <= $$char_class[$i][1] (low <= high) 4: $$char_class[$i][1] + 1 < $$char_class[$i+1][0] (non mergeable)
Exceptions (anchors used only in the parsing phase only):
begin : [ -2, -1 ] end : [ -3, -2 ] begin or end : [ -3, -1 ]
Immediately after parsing, such pseudo-character classes are removed by nfa_resolve_anchors() (internal subroutine).
nfa_resolve_anchors()
Returns the unique $char_class equivalent to [[ord($c), ord($c)]].
[[ord($c), ord($c)]]
$interval_list is an arbitrary list of intervals. Returns the unique $char_class whose reunion of intervals is the same set as the reunion of the intervals of $interval_list.
$interval_list
Example:
interval_list_to_cc([[102, 112], [65, 90], [97, 102], [113, 122]]) returns [[65, 90], [97, 122]] (i.e [f-p]|[A-Z]|[a-f]|[q-z] => [A-Z]|[a-z])
Note that both $interval_list and $char_class are lists of intervals, but only $char_class obeys the constraints above, while $interval_list does not.
Remark also that interval_list_to_cc($char_class) is the identity (returns the same reference as given) on $char_classes returned by either interval_list_to_cc() or char_to_cc().
interval_list_to_cc($char_class)
Returns the unique $char_class containing all characters of all given @char_classes.
@char_classes
WARNING: nfa_xxx() routines are destructive, the $nfa references given as arguments will not be valid $nfa any more. Furthermore, the same $nfa reference must be used only once as argument. For instance, for concatenating a $nfa with itself, nfa_concat(nfa, nfa) does not work; instead, nfa_concat($nfa, nfa_clone($nfa)) must be used; or even nfa_concat(nfa_clone($nfa), nfa_clone($nfa) if the original $nfa is to be used further.
nfa_xxx()
nfa_concat(nfa, nfa)
nfa_concat($nfa, nfa_clone($nfa))
nfa_concat(nfa_clone($nfa), nfa_clone($nfa)
$nfa = [ $state_0, $state_1, ... ] $state = [ $accepting , $transitions ] $transitions = [ [ $char_class_0 => $state_ind_0 ] , [ $char_class_1 => $state_ind_1 ] , ... ]
In the same $transition, $state_ind_i are pairwise different and are valid indexes of @$nfa. There is exactly one initial state at index 0.
$transition
$state_ind_i
@$nfa
nfa_clone(@nfas)
Maps each of the given @nfas to a clone.
@nfas
nfa_quant($in_nfa, $min, $max, $prev_has_suffix, $next_has_prefix)
Precondition: 0 <= $min && ( $max eq '' || $min <= $max)
0 <= $min && ( $max eq '' || $min <= $max)
Returns $out_nfa, a $nfa computed from $in_nfa.
$out_nfa
$in_nfa
Let L be the language accepted by $in_nfa and M the language accepted by $out_nfa. Then a word m belongs to M if and only if and ordered list (l_1, ..., l_r) of words belonging to L exists such that:
$min <= r and ($max eq '' or r <= $max) and m is the concatenation of (l_1, ..., l_r)
Examples with $in_nfa being a $nfa accepting '^a$':
'^a$'
nfa_quant($in_nfa, 2, 4 ) accepts '^a{2,4}$' nfa_quant($in_nfa, 0, '') accepts '^a{0,}$' (i.e. '^a*$')
$pref_has_prefix and $next_has_prefix are hints for dispatching $min, for example:
$pref_has_prefix
$next_has_prefix
$min
'a+' => 'a*a' (!$prev_has_suffix && $next_has_prefix) 'a+' => 'aa*' ( $prev_has_suffix && !$next_has_prefix) 'a{2,}' => 'aa*a' ( $prev_has_suffix && $next_has_prefix)
nfa_concat(@in_nfas)
Returns $out_nfa, a $nfa computed from @in_nfas.
@in_nfas
Let r be the number of given @in_nfas, L_i the language accepted by $in_nfas[$i] and M the language accepted by $out_nfa. Then a word m belongs to M if and only if an ordered list (l_1, ..., l_r) of words exists, l_i belonging to L_i, such that m is the concatenation of (l_1, ..., l_r).
$in_nfas[$i]
nfa_union(@in_nfas)
$out_nfa accepts a word w if and only if at least one of @in_nfas accepts w.
nfa_inter(@in_nfas)
Returns $out_nfa, a $$nfa computed from @in_nfas.
$out_nfa accepts a word w if and only if each of @in_nfas accepts w.
nfa_match($in_nfa, $str)
Returns true if and only if $in_nfa accepts $str.
$str
nfa_isomorph($nfa1, $nfa2)
Returns true if and only if the labeled graphs represented by $nfa1 and $nfa2 are isomorph. While isomorph $nfas accept the same language, the converse is not true.
$nfa1
$nfa2
nfa_to_dfa($in_nfa)
Compute a deterministic finite automaton from $in_nfa (powerset construction).
The data structure of a deterministic finite automaton (dfa) is the same as that of a non-deterministic one, but it is further constrained: For each state and each unicode character there exist exactly one transition (i.e. a pair ($char_class, $state_index)) matching this character.
($char_class, $state_index)
Note that the following constraint hold for both a $dfa and a $nfa: For each pair of state p1 and p2, there exists at most one transition from p1 to p2 (artefact of this implementation).
dfa_to_min_dfa($in_dfa)
Computes a minimal deterministic $dfa from the given $in_dfa (Hopcroft's algorithm).
$in_dfa
Note that the given $in_dfa must be a $dfa, as returned from nfa_to_dfa(), and not a mere $nfa.
nfa_to_dfa()
Myhill-Nerode theorem: two minimal dfa accepting the same language are isomorph (i.e. nfa_isomorph() returns true).
nfa_isomorph()
$tree = [ $star, [ $alt_0, $alt_1, ... ] ] or $char_class # ref($char_class) eq CHAR_CLASS or undef # accepting nothing $alt = [ $tree_0, $tree_1, ... ]
A $tree is a hierarchical data structure used as intermediate form for regular expression generation routines.
$tree
Similar to a parse tree, except that the $trees described here are not the direct result of the parsing routines ere_to_xxx(); indeed, the parsing routines generate a $nfa, which then can be converted to a $tree.
ere_to_xxx()
A string is spanned by $tree = [$star, [ $alt_0, $alt_1, ... ] ] if it is spanned by one of the $alt_i (if $star is false) of a repetition thereof (if $star is true).
$tree = [$star, [ $alt_0, $alt_1, ... ] ]
$alt_i
$star
A string is spanned by $alt = [ $tree_0, $tree_1, ...] if it is the concatenation of @substrings, each $substrings[$i] being spanned by $$alt[$i].
$alt = [ $tree_0, $tree_1, ...]
@substrings
$substrings[$i]
$$alt[$i]
nfa_to_tree($nfa)
Converts a $nfa to a $tree. Returns undef if the $nfa accepts nothing (not even the empty string).
undef
tree_to_regex($tree, $to_perlre)
Converts a $tree to an $ere (if $to_perlre is false) or to a $perlre (if $to_perlre is true).
$to_perlre
$perlre
$input_constraints = [ $input_constraint_0, $input_constraint_1, ... ] $input_constraint = [ 'word_0', 'word_1', ... ] (drop down) or 'free_text' (free text)
tree_to_input_constraints($tree)
Converts a $tree to a pair ($input_constraints, $split_str).
($input_constraints, $split_str)
$split_perlre is a compiled perl regular expression splitting a string accordingly to $input_constraints. This $perlre matches if and only if each drop down can be assigned a value; then $str =~ $perlre in list context returns as many values as @$input_constraints.
$split_perlre
$input_constraints
$str =~ $perlre
@$input_constraints
An $ere is a perl string.
The syntax an $ere is assumed to follow is based on POSIX ERE (else the ere_to_xxx() routines will die()).
die()
Unsupported POSIX features: back-references, equivalence classes [[=a=]], character class [[:digit:]], collating symbols [[.ch.]].
[[=a=]]
[[:digit:]]
[[.ch.]]
) is always a special character. POSIX says that ) is a normal character if there is no matching (.
)
(
There is no escape sequences such as \t for tab or \n for line feed. POSIX does not specify such escape sequences neither.
\t
\n
\ before a non-special character is ignored (except in bracket expressions). POSIX does not allow it.
\
The empty string is legal in alternations ((|a) is equivalent to (a?)). POSIX does not allow it. The (|a) form is generated by the xxx_to_ere() routines (avoiding quantifiers other than *).
(|a)
(a?)
xxx_to_ere()
*
[a-l-z] is interpreted as ([a-l] | - | z) (but it is discouraged to rely upon this implementation artefact). POSIX says that the interpretation of this construct is undefined.
[a-l-z]
([a-l] | - | z)
In bracket expressions, \ is a normal character, thus ] as character must occur first, or second after a ^ (POSIX compliant, but possibly surprising for perl programmers).
]
^
All unicode characters supported by perl are allowed as literal characters.
Parses an $ere to a $nfa.
WARNING: the parsing routines, in particular ere_to_xxx(), die() on syntax errors; thus the caller may want to eval-trap such errors.
Returns $string with escaped special characters.
ere_to_tree($ere)
nfa_to_tree(ere_to_nfa($ere))
ere_to_regex($ere, $to_perlre)
tree_to_regex(ere_to_tree($ere), $to_perlre)
nfa_to_regex($nfa, $to_perlre)
tree_to_regex(nfa_to_tree($nfa), $to_perlre)
ere_to_input_constraints($ere)
tree_to_input_constraints(ere_to_tree($ere))
nfa_to_input_constraints($nfa)
tree_to_input_constraints(nfa_to_tree($nfa))
nfa_to_min_dfa($nfa)
dfa_to_min_dfa(nfa_to_dfa($nfa))
Loïc Jonas Etienne <loic.etienne@tech.swisssign.com>
Artistic License 2.0 http://www.perlfoundation.org/artistic_license_2_0
To install Regexp::ERE, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Regexp::ERE
CPAN shell
perl -MCPAN -e shell install Regexp::ERE
For more information on module installation, please visit the detailed CPAN module installation guide.