NAME

Perl6::Rules - Implements (most of) the Perl 6 regex syntax

SYNOPSIS

# Perl 5 code...

use Perl6::Rules;

grammar HTML {
    rule doc  :iw { \Q[<HTML>]  <?head>  <?body>  \Q[</HTML>] }
    rule head :iw { \Q[<HEAD>]  <?head_tag>+  \Q[<HEAD>] }
    # etc.
}

$text =~ s:globally:2nd/ <?HTML.doc> /$0{doc}{head}/;

rule subj  { <noun> }
rule obj   { <noun> }
rule noun  { time | flies | arrow }
rule verb  { flies | like | time }
rule adj   { time }
rule art   { an? }
rule prep  { like }

"time flies like an arrow" =~
    m:words:exhaustive/^ [ <?adj>  <?subj> <?verb> <?art> <?obj>
                         | <?subj> <?verb> <?prep> <?art> <?noun> 
                         | <?verb> <?obj>  <?prep> <?art> <?noun>
                         ]
                      /;

print "Found interpretation:\n", $_->dump
    for @$0;


$dna_seq =~ m:overlap{ A <[CT]> <[AG]><3,7> <before: C> };

print "Found sequence: $_ starting at " $_->pos
    for @$0;

# etc.

DESCRIPTION

This module implements a close simulation of the Perl 6 rule and grammar constructs, translating them back to Perl 5 regexes via a source filter. (And hence suffers from all the usual limitations of a source filter, including the ability to translate complex code spectacularly wrongly).

See LIMITATIONS for a summary of those features that are not currently supported.

When it is use'd, the module expects that any subsequent match (m/.../) or substitution (s/.../.../) in the rest of the source file will be in Perl 6 syntax. It then translates every such pattern back to the equivalent Perl 5 syntax (where possible).

When one of these translated matches/substitutions is executed, it generates a "match object", which is available as $0 (and so, if you use Perl6::Rules, the program name is no longer available as $0). This match object can be treated as a boolean (in which case it returns true if the match succeeded, and false if it did not), or as a string (in which case it returns the complete substring that the match matched), or as an array (in which case it contains all of the numbered captures -- $1, $2, etc. -- from the successful match), or as a hash (in which case it contains all of the internal variables created during the match).

Atoms

Except for the special characters:

#  $  @  %  ^  &  *  +  ?  (  )  {  }  [  ]  <  >  .  |  \

whitespace, and certain special character sequences (see below), any character in a rule matches itself.

Special characters can be made to match themselves by backslashing them:

\#  \$  \@  \%  \^  \&  \*  \+  \?  \(  \)  \{  \}  \[  \]  \<  \>  \.  \|  \\

or by using one of the Perl 6 quoting constructs.

Quantifiers

Quantifiers control how often a particular atom matches. Without a quantifier an atom must match exactly once. The Perl 6 quantifiers are:

atom?           Match the atom zero or one times
                preferring to match once, if possible

atom??          Match the atom zero or one times 
                preferring to match zero times, if possible

atom*           Match the atom zero or more times
                preferring to match as many times as possible

atom*?          Match the atom zero or more times
                preferring to match as few times as possible

atom+           Match the atom one or more times
                preferring to match as many times as possible

atom+?          Match the atom one or more times
                preferring to match as few times as possible

atom<7>         Match the atom exactly 7 times
                (Any positive integer can be used)

atom<7,11>      Match the atom between 7 and 11 times
                preferring to match as many times as possible.
                (Any positive integers can be used)

atom<7,11>?     Match the atom between 7 and 11 times
                preferring to match as few times as possible.
                (Any positive integers can be used)

atom<4,>        Match the atom 4 or more times
                preferring to match as many times as possible.
                (Any positive integers can be used)

atom<4,>?       Match the atom 4 or more times
                preferring to match as few times as possible.
                (Any positive integers can be used)

Note: Perl 6 also allows the numbers in these ranges to be specified as interpolated variables, but due to limitations of the Perl 5 regex engine, the Perl6::Rules module does not currently support this feature.

Alternatives

The | operator separates two alternative subpatterns. The resulting pattern matches if either of the alternatives matches:

$animal =~ m/ cat | dog | fish | bird /;

Note: Perl 6 also provides an & operator, but this is not yet supported by Perl6::Rules.

Special metasequences

A dot (.) matches any character at all (including a newline).

There are numerious backslashed metasequences, that match a particular single character, usually belonging to a particular class of characters:

\d   Match a single digit
\D   Match any single character except a digit
\e   Match a single escape character
\E   Match any single character except an escape character
\f   Match a single formfeed
\F   Match any single character except a formfeed
\h   Match a single horizontal whitespace
\H   Match any single character except a horizontal whitespace
\n   Match a single newline
\N   Match any single character except a newline
\r   Match a single carriage return
\R   Match any single character except a carriage return
\s   Match a single whitespace character
\S   Match any single character except a whitespace
\t   Match a single tab character
\T   Match any single character except a tab character
\v   Match a single vertical whitespace
\V   Match any single character except a vertical whitespace
\w   Match a single "word" character (alpha, digit, or underscore)
\W   Match any single character except a "word" character

Specifying characters by name or code

Any character can be specified by (Unicode) name, using the \c escape. For example:

\c[LF]
\c[ESC]
\c[CARRIAGE RETURN]
\c[ARABIC LIGATURE TEH WITH MEEM WITH JEEM INITIAL FORM]
\c[HEBREW POINT HIRIQ]
\c[LOWER HALF INVERSE WHITE CIRCLE]

Two or more such named characters can be specified in the same set of square brackets, separated by a comma:

\c[CR;LF]
\c[ESC;LATIN CAPITAL LETTER Q]

The \C escape produces the complement of the character:

\C[LF]                  Any character except LINE FEED
\C[ESC]                 Any character except ESCAPE
\C[CARRIAGE RETURN]     Any character except CARRIAGE RETURN

The square brackets are always required for named characters.

Characters and character sequences can also be specified by hexadecimal or octal Unicode code:

\x[A]           LINE FEED
\0[12]          LINE FEED
\x[1EA2]        LATIN CAPITAL LETTER A WITH HOOK ABOVE
\0[17242]       LATIN CAPITAL LETTER A WITH HOOK ABOVE
\x[1EA2;A]      LATIN CAPITAL LETTER A WITH HOOK ABOVE; LINE FEED
\0[17242;12]    LATIN CAPITAL LETTER A WITH HOOK ABOVE; LINE FEED

Hexadecimal codes may also be complemented:

\X[A]           Any character except LINE FEED
\X[1EA2]        Any character except LATIN CAPITAL LETTER A WITH HOOK ABOVE

For single coded characters, the square brackets are not required (except to avoid ambiguity):

\xA             LINE FEED
\012            LINE FEED
\x1EA2          LATIN CAPITAL LETTER A WITH HOOK ABOVE
\017242         LATIN CAPITAL LETTER A WITH HOOK ABOVE
\XA             Any character except LINE FEED
\X1EA2          Any character except LATIN CAPITAL LETTER A WITH HOOK ABOVE

Anchors and assertions

Anchors and assertions do not match any characters in the string, but instead test whether a particular condition is true, and cause the match to fail if it is not.

Perl6::Rules supports the following Perl 6 rule assertions:

 ^   Currently matching at the start of the entire string
^^   Currently matching at the start of a line within the string
 $   Currently matching at the end of the entire string
$$   Currently matching at the end of a line within the string

Note that neither $ nor $$ allows for an optional newline before the "end" in question. Use \n?$ and \n?$$ if you require those semantics more forgiving semantics.

<before: subpat>    The current match position is immediately
                    before the specified subpattern
<!before: subpat>   The current match position is not immediately
                    before the specified subpattern

<after: subpat>     The current match position is immediately
                    after the specified subpattern
<!after: subpat>    The current match position is not immediately
                    after the specified subpattern

\b   The current match position is in the middle of a \w\W or \W\w
     sequence (i.e. <after:\w><before:\W> | <after:\W><before:\w> )
\B   The current match position is in the middle of a \w\w or \W\W
     sequence (i.e. <after:\w><before:\w> | <after:\W><before:\W> )

Note: Due to limitations in the Perl 5 regex engine, the <after:...> assertion requires that the subpattern always match a substring of fixed length.

Grouping

To group a sequence of characters and have them treated as an atom, use square brackets:

$status =~ m/ [in]?valid /;

Square brackets group, but do not capture.

Capturing

To group a sequence of characters and have the matching substring captured as well, use parentheses instead of square brackets. Each parenthesis captures into a successive "numeric" variable:

$name =~ m/ (Mr|Mrs|Ms|Dr|Prof|Rev) (.+)  /;

print "Title: $1\n";
print "Name:  $2\n";

Whitespace indifference

Whitespace is not significant in a rules and is usually ignored when matching a pattern. For example, this:

m/ <ident> = \N+ /;

matches exactly the same set of strings as:

m/<ident>=\N+/;

To match actual whitespace in a string, use the appropriate backslash escape:

m/ <ident> \h* = \s* \N+/;

or named characters:

m/ <ident> <ws> = <sp>+ \N+/;

Making whitespace meaningful

Just because whitespace is not significant in a rule doesn't mean it's not significant in the string that a rule is matching. For example:

$str = "module_name = Perl6::Rules";

$str =~ m/ <ident> = \N+ /;

will not match, because there is nothing in the rule to match the whitespace in the string between "module_name" and "=".

However, you can tell a rule to ignore whitespace in the string, by specifying the :w or :words modifier:

$str = "module_name = Perl6::Rules";

$str =~ m:words/ <ident> = \N+ /;

This modifier causes each whitespace sequence in the rule to be automagically replaced by a \s* or \s+ subpattern. That is:

m:words/ next cmd  = \h* <condition>/

Is the same as:

m/ \s* next \s+ cmd \s* = \h* <condition>/

If the whitespace is between two "word" atoms -- as it is between next and cmd in the above example -- then a \s+ (mandatory whitespace) is inserted. If the whitespace is between a "word" and a "non-word" atom -- as it is between cmd and = above -- then a \s* (optional whitespace) is inserted. If the atom on either side of the whitespace would itself match whitespace -- as for = and \h*, and \h* and <condition> -- then no extra whitespace matching is inserted.

The overall effect is that, under :words, any whitespace in the rule matches any whitespace in the string, in the most reasonable way possible.

Comments

Any unbackslashed # character in a pattern starts a comment which runs to the end of the current line.

m/ <ident>  # name of environment variable
   \h*      # optional whitespace, but stay on the same line
   =        # indicates that the variable is being set
   \s*      # optional whitespace, can be on separate lines
   \N+      # everything else up to the end-of-line is the value
 /;

Evaluated substitutions

When performing a substitution it is possible to interpolate code into the replacement string using the Perl 6 $(...) or @(...) interpolators:

s/ (<sentence>) /On a dit: $( traduisez($1) )/

Note: Perl6::Rules currently only allows substitutions to have a single $(...) or @(...) in the replacement string.

Repeated matches and substitutions

To cause a match or substitution to match or substitute as many times as possible, specify the :g or :globally modifier before the pattern:

$str =~ s:g{foo}{bar};          # s/foo/bar/ as many times as possible
$str =~ s:globally{foo}{bar};   # Ditto

To cause a match or substitution to match or substitute a particular number of times, specify the :x(...) modifier:

$str =~ s:x(2){foo}{bar};       # s/foo/bar/ only the first two times 
                                # "foo" is found

$str =~ s:x(7){foo}{bar};       # s/foo/bar/ only the first seven times 
                                # "foo" is found

The repetition count can be a variable:

for my $n (2..7) {
    $str[$n] =~ s:x($n){foo}{bar};  # s/foo/bar/ only the first $n times 
                                    # "foo" is found
}

If the repetition count is a constant, the :x(...) modifier can also be written as a suffix:

$str =~ s:2x{foo}{bar};     # s/foo/bar/ only the first two times 
                            # "foo" is found

$str =~ s:7x{foo}{bar};     # s/foo/bar/ only the first seven times 
                            # "foo" is found

If you only want the 2nd (or 7th, or $n-th, etc.) occurance changed, you can use the nth(...) modifier instead:

$str =~ s:nth(2){foo}{bar};     # s/foo/bar/ only for the second occurance
                                # of "foo" in the string

$str =~ s:nth(7){foo}{bar};     # s/foo/bar/ only for the seventh occurance
                                # of "foo" in the string

$str =~ s:nth($ord){foo}{bar};  # s/foo/bar/ only for the $ord-th occurance
                                # of "foo" in the string

If the ordinal number is a constant, the :nth(...) modifier can also be written as a suffix:

$str =~ s:2nd{foo}{bar};        # s/foo/bar/ only the first two times 
                                # "foo" is found

$str =~ s:7th{foo}{bar};        # s/foo/bar/ only the first seven times 
                                # "foo" is found

You can also combine :globally with an ordinal modifier. For example, to replace every third occurance of "foo" with "bar":

$str =~ s:globally:3rd{foo}{bar}

Variations on global matching

Rules that match :globally do so by matching once, then restarting their search at the first character after the end of the previous match. But there are (at least) two other alternative restart strategies for global matching, both of which Perl 6 (and Perl6::Rules) supports.

Matching :globally will never find overlapping matches. For example:

$dna = "ACGTAGTCATGACGTACCA";

$dna =~ m:globally{ A [ACGT]* T };

will only match:

"ACGTAGTCATGACGT"

after which it will try again on the remainder of the string ("ACCA") and fail.

But if you actually wanted overlapping matches from every possible start position:

"ACGTAGTCATGACGT"
    "AGTCATGACGT"
        "ATGACGT"
           "ACGT"

then you need to specify :o or :overlap, instead of :globally:

$dna =~ m:overlap{ A [ACGT]* T };

This works just like :globally, except that, instead of restarting the search from the first character after the end of the previous match, it restarts the search from the first character after the start of the previous match. Hence it will only ever find one match from any given starting position in the string, but it will find matches from every possible starting position, including those matched that overlap.

Even that may not be enough. Rather than one match at every starting position, you may require every possible match at every starting position:

"ACGTAGTCATGACGT"
"ACGTAGTCAT"     
"ACGTAGT"
"ACGT"
    "AGTCATGACGT"
    "AGTCAT"     
    "AGT"        
        "ATGACGT"
        "AT"     
           "ACGT"

To match in this way, use the :e or :exhaustive modifier:

$dna =~ m:exhaustive{ A [ACGT]* T };

Note that, when either :overlap or :exhaustive are specified, the match result returned in $0 changes in structure. For a non-overlapping match $0 consists of:

 $0     # Complete substring matched
@$0     # Unnamed captures: ($0, $1, $2, ...)
%$0     # Named captures

For an overlapping/exhaustive match, $0 consists of:

 $0     # undef
@$0     # The complete $0 of each successive overlapping match
%$0     # Empty hash

Ignoring case

If you use the :i or :ignorecase modifier, the match ignores upper and lower case distinctions:

$str =~ m:i/perl/;      # Match "Perl" or "perl" or "pErL", etc.

The :i marker can also be placed inside a rule, to turn off case sensitivity in only part of the rule:

$title =~ m/The <sp> [:i journal <sp> of <sp> the ] <sp> ACM /;
#
#   match: The Journal Of The ACM
#      or: The journal of the ACM
# but not: The journal of the acm

Backtracking control

In Perl 6 a single colon is ignored when matching (or, in other words, it matches zero characters).

However, should the pattern subsequently fail to match and backtrack over the single colon, it will not retry the preceding atom. So if you write:

$str =~ m:words/ \( <expr>  [ , <expr> ]* :  \) /

and the match fails to find the closing parenthesis (and hence starts backtracking), it will not attempt to rematch [ , <expr ]*> with one fewer repetition,but will continue backtracking and ultimately fail. This is a useful optimization since a match with one less comma'd expression still wouldn't have a parenthesis after it, so trying it would be a waste of time).

Note: Due to the opaque nature of backtracking in the Perl 5 regex engine, Perl6::Rules cannot efficiently implement the "higher level" backtracking control features: ::, :::, commit, and cut. So these constructs are not currently supported.

Starting position

Normally a rule attempts to match from the start of a string. But you can tell the rule to match from the current <pos> of the string by specifying the :c (or :cont) modifier:

$str =~ m:c/ pattern /  # start where the previous match on $str finished

Code blocks

You can place a Perl code block inside a rule. It will be executed when the rule reaches that point in its matching. Code execution does not usually affect the match; it is typically only used for side-effects:

m/ (\S+) { warn "string not blank..."; $text=$1; }
    \s+  { warn "...but does contain whitespace" }
 /

Note that variables accessed within a code block (or indeed anywhere else inside a Perl 6 rule) must be accessed in Perl 6 syntax. So, this:

m:g/ (\S+) { $::found{$1}++ } /;

is equivalent to the Perl 5:

/ (\S+) (?{ $::found->{$^N}++ }) /g;

and to increment an entry in %::found we'd need the correct Perl 6 syntax:

m:g/ (\S+) { %::found{$1}++ } /;

A code block can be made to cause a match to fail, if it calls the fail function (which is automatically exported from Perl6::Rules):

$count =~ / (\d+): {$1<256 or fail} /

By the way, that "no backtracking" colon is critical there. If $count contained 1000, then $1 would be "1000", the code would execute fail and the rule would backtrack. The colon prevents the \d+ pattern from then rematching just "100" instead of the full "1000", which would erroneously allow the pattern to match.

Code assertions

Blocks of the form { sometest() or fail } are so common that Perl 6 rules (and hence Perl6::Rules) provide a shorthand. Any expression in a <(...)> is treated as a code assertion, which causes a match to fail and backtrack if it is not true at that point in the match. For example, you could rewrite:

$count =~ m/ (\d+): {$1<256 or fail} /

more simply as:

$count =~ m/ (\d+): <($1<256)> /;

Literal variable interpolation

Variables that appear in a Perl 6 rule interpolate differently to variables that appear in a Perl 5 regex. Specifically, in Perl 5:

$dir = "lost+found";
$str =~ /$dir/;

is the same as:

$str =~ /lost+found/;

which would match:

"lostfound"
"losttfound"
"lostttfound"
"losttttfound"
etc.

In Perl 6, an interpolated scalar variable eq matches its contents against the string. So:

use Perl6::Rules;
$dir = "lost+found";
$str =~ m/$::dir/;

would treat the contents of $dir as a literal sequence of characters to match, and hence (only) match:

"lost+found"

An interpolated array:

use Perl6::Rules;
@cmds = ('get','put','save','load','dump','quit');
$str =~ m/ @::cmds /;

matches if any of its elements eq matches the string at that point. So the above example is equivalent to:

$str =~ /get|put|save|load|dump|quit/;

An interpolated hash matches a /\w+/ sequence and then requires that that sequence is a valid key of the hash. So:

use Perl6::Rules;

my %cmds = ( get=>'Shorty', put=>'down', quit=>'griping' );

$str =~ m/ %::cmds /;

is a shorthand for:

/ (\w+) { fail unless exists %::cmds{$1} } /

Note that the actual values in the hash are ignored.

However, if the hash being interpolated has a keymatch trait:

use Perl6::Rules;

my %cmds is keymatch(rx/<alpha>+:/)
    = ( get=>'Shorty', put=>'down', quit=>'griping' );

then the rule into which it's interpolated uses that trait's value instead of \w+ as the required subpattern. In which case:

$str =~ m/ %::cmds /;

would become a shorthand for:

/ (<alpha>+:) { fail unless exists %::cmds{$1} } /

instead.

Furthermore, if the interpolated hash also has a valuematch trait:

use Perl6::Rules;

my %cmds is keymatch(rx/<alpha>+:/)
         is valuematch(rx/\s+ <alpha>+:/)
    = ( get=>'Shorty', put=>'down', quit=>'griping' );

then, after the key has been successfully matched, the rule attempts to match the valuematch pattern, and requires that this secondary match be equal to the value for the previously matched key. That is, with a valuematch trait as well, this:

$str =~ m/ %::cmds /;

would become a shorthand for:

/ (<alpha>+:)     { fail unless exists %::cmds{$1} }
  (\s+ <alpha>+:) { fail unless $2 eq %::cmds{$1}  }
/

In other words, when both traits are specified, an interpolated hash has to match one of its keys, followed by that key's value.

Non-literal variable interpolation

Sometimes it would be more useful to interpolate a variable not as a literal sequence of characters to be matched, but rather as a subpattern to be matched (i.e. the way Perl 5 does).

To interpolate a variable in that way in a Perl 6 rule, place the variable in angle brackets. That is:

use Perl6::Rules;
$exclamation = rx/Shee+sh/;
$str =~ m/ <$::exclamation> /;

would treat the contents of $::exclamation as a subpattern (rather than as a literal sequence of characters to match) and hence match:

"Sheesh"
"Sheeesh"
"Sheeeesh"
etc.

but not:

"Shee+sh"

An angle-bracketed interpolated array:

use Perl6::Rules;
@cmds = ( rx/<[gs]>et/, rx/put/, rx/save?/, rx/q[uit]?/ );
$str =~ m/ <@::cmds> /;

treats each of its elements as a subpattern, and matches if any of them matches at that point. So the above example is equivalent to:

$str =~ m/ <[gs]>et | put | save? | q[uit]?/;

(i.e. with the metasequences left intact).

An angle-bracketed interpolated hash first matches a /\w+/ sequence and requires that that sequence is a valid key of the hash. It then treats the corresponding hash value as a subpattern and requires that that subpattern match too. So:

use Perl6::Rules;

my %cmds =
    ( get=>rx/\s+ <ident>/, put=>rx:i/\s+down/, quit=>rx/[\s+ griping]?/);

$str =~ m/ %::cmds /;

is a shorthand for:

$str =~ m/ (\w+) { fail unless exists %::cmds{$1} }
           <%::cmds{$1}>
         /

Once again, if the hash being interpolated has a keymatch trait that trait's value is used instead of \w+ to match the key. However, any valuematch trait on an angle-bracketed hash is ignored.

Note: due to limitations of nesting pattern matches, Perl6::Rules requires that any value in an angle-bracketed hash or array must be a precompiled pattern (i.e. either a Perl5-ish qr/.../ or a Perl6-ish rx/.../), not a string.

Predefined named rules

Certain named rules are predefined by Perl 6 (and hence by the Perl6::Rules module). They are:

<ws>        Match any sequence of whitespace
<ident>     Match an identifier (alpha or underscore, followed by \w*)
<prior>     Match using the most recent successful rule
<self>      Match this entire pattern (recursively)
<sp>        Match a single space char
<null>      Match zero characters (i.e. unconditionally)
<alpha>     Match a single alphabetic character
<space>     Match a single whitespace character
<digit>     Match a single digit
<alnum>     Match a single alphabetic or digit
<ascii>     Match a single ASCII character
<blank>     Match a single space or tab 
<cntrl>     Match a single control character
<ctrl>      Match a single control character
<graph>     Match a single non-control character
<lower>     Match a single lower-case character
<print>     Match a single printable character
<punct>     Match a single punctuation character
<upper>     Match a single upper-case character
<word>      Same as \w
<xdigit>    Match a single hexadecimal digit

In addition, every long- or short-form Unicode property name is a valid predefined subrule. For example:

<L> or <Letter>             Match any letter
<Lu> or <UppercaseLetter>   Match any upper-case letter

<Sm> or <MathSymbol>        Match any mathematical symbol

<BidiWS>                    Match any bidirectional whitespace

<Greek>                     Match any Greek character
<Mongolian>                 Match any Mongolian character
<Ogham>                     Match any Ogham character

<Any>                       Match any character

<InArrows>                  Match any character in the "Arrows" block
<InCurrencySymbols>         Match any character in the "CurrencySymbols" block

etc.

In addition, Perl6::Rules supports the Perl-specific <Lr> property, which replaces the non-standard Perl5-specific <L&> property, which matches any upper-, lower-, or title-case letter.

Note that any such named subrule that matches exactly one character may also be used inside a character class.

Code interpolations

uormally code blocks don't actually match against anything. To make them do so, put the code block in angle-brackets. For example:

/ (@::cmds)  <{ get_body_for_cmd($1) }> /

This first matches one of the elements of @cmds (as a literal substring). It then calls the get_body_for_cmd subroutine, passing it that substring. The value returned by that call is then used as a subpattern, which must match at that point.

Note: due to limitations of nesting pattern matches, Perl6::Rules requires that any <(...)> block must return a precompiled pattern (i.e. either a Perl5-ish qr/.../ or a Perl6-ish rx/.../), not a string.

Character classes

A character class is an enumerated set of characters and/or properties. In Perl 6, character classes are specified by square brackets inside angle brackets:

$str =~ m/ <[A-Za-z_]> <[A-Za-z0-9_]>* /    # Match an ASCII identifier

A normal character class can also be indicated by a leading plus sign, whilst a complemented character class (i.e. "any character except...") is indicated by a leading minus sign:

$str =~ m/ <[aeiou]> /      # Match a vowel
$str =~ m/ <+[aeiou]> /     # Match a vowel
$str =~ m/ <-[aeiou]> /     # Match a character that isn't a vowel

Two or more square-bracketed sets (including their optional signs) can be placed in the same angle brackets:

$str =~ m/ <[aeiou][tlc]> /     # Match a vowel or 't' or 'l' or 'c'
$str =~ m/ <[aeiou]+[tlc]> /    # Match a vowel or 't' or 'l' or 'c'
$str =~ m/ <[a-x]-[aeiou]> /    # Match a letter between 'a' and 'x'
                                # but not a vowel

Named properties, subrules and backslashed escapes that match a single character can also be placed in the character set:

$str =~ m/ <<alpha>-[aeiou]> /  # Match a non-vowel alphabetic
$str =~ m/ <[\w]-<digit>> /     # Match first letter of an identifier

Interpolated literal strings

Any single-quoted string in angle brackets is treated as a literal sequence of characters to be matched at that point. Whitespace and other metacharacters within the string must match literally.

For example:

$text =~ m/ .*? <'# # # # #'> /;    # Match to first '# # # # #'

Another way to get the same effect is to use a "quotemeta" block:

$text =~ m/ .*? \Q[# # # # #] /;    # Match to first '# # # # #'

The subpattern inside the square brackets following the \Q is treated as a literal string, to be eq matched.

Backreferences

Because variables are interpolated at match-time in Perl 6 rules, backreferences to earlier captures are written as variables, not as backslashed numbers. So, to remove doubled words:

$text =~ s:words:globally{( <alpha>+) $1}{$1};

Anonymous rule constructors

Under Perl6::Rules, if you use qr to create an anonymous rule you get the Perl 5 interpretation of the pattern:

use Perl6::Rules;

my $pat = qr/[a-z+]:\0[123]/;

#  [a-z+]   Match one lower-case alpha or a '+',
#  :        Match a literal colon,
#  \0       Match a null byte,
#  [123]    Match a '1', a '2', or a '3'

To get the Perl 6 interpretation, use the Perl 6 anonymous rule constructor (rx) instead:

use Perl6::Rules;

my $pat = rx/[a-z+]:\0[123]/;

#  [        Without capturing...
#    a-     Match 'a-',
#    z+     Match 'z' one or more times
#  ]        End of group
#  :        Don't backtrack into previous group on failure
# \0[123]   Match an 'S' (specified via octal code)

You can also use the keyword rule there:

my $pat = rule {[a-z+]:\0[123]};

Note: The rx keyword allows {...}, [...], <...>, or /.../ as pattern delimiters. The rule keyword allows only {...}.

If either needs modifiers, they go before the opening delimter, as for matches and substitutions:

my $pat = rule :wi { my name is (.*) };
my $pat = rx:wi/ my name is (.*) /;

Named Rules

The rule keyword can also be used to create new named rules, by adding the rule name immediately after the keyword:

rule alpha_ident { <alpha> \w* }

# and later...

@ids = grep m/<alpha_ident>/, @strings;

In the Perl6::Rules implementation such a rule declaration actually creates a subroutine of the same name within the current Perl 5 namespace.

Note: Due to bugs in the current Perl 5 regex engine, captures that occur in named rules that are called as subrules from other rules may not work correctly under Perl6::Rules, and will frequently lead to segfaults and bus errors.

Named captures to external variables

Any set of capturing parentheses can be prefixed with the name of a variable followed by :=. The variable is then used as the destination of the captured substring, instead of assigning it to the next numbered variable.

For example, after:

$input =~ / [ $::num  := (\d+)
            | $::alpha:= (<alpha>+)
            | $::other:=(.)
            ]
          /

then one of $::num, $::alpha, or $::other with have been assigned the captured substring from whichever subpattern actually matched. But none of $1, $2, $3 will have been set (since the named capture overrides the normal numbered capture mechanism).

You can, however, explicitly assign to a numeric variable (for example, to reorder them in some fiendish way):

$pair =~  m:words{ $1:=(\w+) =\> $2:=(.*)
                 | $2:=(.*?) \<= $1:=(\w+)
                 };

Note: due to unreliable interactions between Perl 5 regexes and lexical variables in the current Perl 5 regex engine, under this version of Perl6::Rules only explicitly-qualified package variables and unqualified numeric variables may be used in rules.

Repeated captures can be bound to arrays:

$list =~ m/ @::values:=[ (.*?) , ]* /;

in which case each captured substring will be pushed onto @::values.

Pairs of repeated captures can be bound to hashes:

$opts =~ m:words/ %::options:=[ (<ident>) = (\N+) ]* /;

in which case the first capture in each repetition becomes the key and the second capture becomes the value. If there are more than two captures, the value for that key becomes an array reference, and the second and subsequent captures are stored in that array.

If a single repeated capture is bound to a hash, each captured substring becomes a key of the hash (and the corresponding values are undef):

$opts =~ m:words/ %::options:=[ (<ident>) = \N+ ]* /

Named captures to internal variables

Perl 6 rules also have their own internal namespace, with their own internal variables. Those variables are marked by a secondary '?' sigil. For example:

$input =~ / [ $?num  := (\d+)
            | $?alpha:= (<alpha>+)
            | $?other:=(.)
            ]
          /

After this match succeeds, one of the three internal variables will have been set. To access these variables, treat $0 as a hash reference:

   if (exists $0->{num})   { print "Got number: $0->{num}\n" }
elsif (exists $0->{alpha}) { print "Got alpha:  $0->{alpha}\n" }
elsif (exists $0->{other}) { print "Got other:  $0->{other}\n" }

Scalar internal variables are stored under a key that is the name of the variable stripped of its leading $?. Array and hash internal variables are stored under their full variable name. For example:

$list =~ m/ @?values:=[ (.*?) , ]* /;

for (@{ $0->{'@?values'} }) {
    print "Another values was: $_\n";
}

Named subrules can also capture their result into an internal scalar variable of same name. To do so, prefix the rule name inside the angle- brackets with a question-mark:

$pair =~ m:words/ <?key> =\> <?value> /;

print "Key was: $0->{key}\n";
print "Val was: $0->{value}\n";

Naturally enough, internal variables can also be accessed within the rule itself. For example:

$pair =~ m:words/ <?key> =\> <?value> { $?first = substr($?key,0,1) /;
print "Key starts:  $0->{first}";
print "Key was:     $0->{key}\n";
print "Val was:     $0->{value}\n";

Return values from matches

In Perl 6, a match always returns a "match object", which is also available as (lexical) $0. This match object evaluates differently in different contexts:

  • In a boolean context it evaluates true or false (i.e. did the match succeed?)

    m/<ident>/;
    if ($0) {
        print "Success!\n";
    }
  • In a string context it evaluates to the captured substring:

    do {
        $text =~ m:cont/,? (<ident>)/ and print $hash{$0};
    } while $0;
  • When used as an array reference, $0 provides a reference to an array containing the numbered captures:

    $text =~ m:words/ (<ident>) \: (\N+)/;
    
    print "Option was:   $0->[0]\n";    # $0->[0] same as "$0"
    print "Option name:  $0->[1]\n";    # $0->[1] same as  $1
    print "Option value: $0->[2]\n";    # $0->[2] same as  $2
                                        # etc.
  • When used as a hash reference, $0 provides a reference to a hash containing its internal named variables:

    $text =~ m:words/ <?ident> \: @?vals:=[\s* (\S+)]+ /;
    
    print "Option name: ", $0->{ident}, "\n";
    print "Option vals: ", @{ $0->{'@?vals'} }, "\n";

Since it is not feasible to intercept the return value of a Perl 5 regex match, under Perl6::Rules, the return value is still the Perl 5 return value. However, $0 is set to the polymorphic match object shown above.

Note that within a regex, $0 acts like an internal variable, so you can capture or assign to it to control the overall substring that is returned. For example:

use Perl6::Rules;

$quoted_str =~ m{ (<["'`]>) ([\\?.]*?) $1 }
#
# default behaviour: "$0" includes delimiters


$quoted_str =~ m{ (<["'`]>) $0:=([\\.|<!$1>]*) $1 }
#
# "$0" now excludes delimiters because it was
# explicitly bound only to contents of quoted string

Grammars

Named rules can be placed in a particular namespace, called a "grammar". For example:

grammar Identity {
    rule name :words { Name \: (\N+) }
    rule age  :words { Age  \: (\d+) }
    rule addr :words { Addr \: (\N+) }
    rule desc :words { <name> <age> <addr> }

    # etc.
}

Then, to access these named rules, call them as if they were (Perl 6) methods:

$id =~ m/ <Identity.desc> /;

Note: Perl6::Rules uses a regular package for each grammar you specify, adding each rule as a subroutine of that package. Be careful not to clobber your existing packages and classes when defining new grammars.

Like classes, grammars can inherit:

grammar Letter {
    rule text     { <greet> <body> <close> }

    rule greet :w { [Hi|Hey|Yo] $to:=(\S+?) , $$}

    rule body     { <line>+ }

    rule close :w { Later dude, $from:=(.+) }

    # etc.
}

grammar FormalLetter is Letter {

    rule greet :w { Dear $to:=(\S+?) , $$}

    rule close :w { Yours sincerely, $from:=(.+) }

}

This syntax is fully supported by Perl6::Rules.

Note: Due to bugs in the Perl 5 regex engine, captures that occur in rules or subrules called in from other grammatical namespaces may not work correctly under Perl6::Rules, and will frequently lead to segfaults and bus errors.

DEBUGGING

If the module is loaded with the -translate flag:

use Perl6::Rules -translate;

it translates any subsequent Perl 6 rules back to Perl 5 syntax, prints the translated source file, and exits before attempting to compile it.

If the module is loaded with the -debug flag:

use Perl6::Rules -debug;

it adds a considerable number of debugging statements into each translated rule, producing extensive tracking of the construction and matching of each rule.

The match object ($0) also provides a dump method that shows the various values that were retrieved from the match.

LIMITATIONS

This module implements most, but not all, of the proposed Perl 6 semantics. Generally speaking, a Perl 6 feature has been omitted only where there is no way (or no efficient way) to implement it within the constraints of the Perl 5 regex engine.

  • Only one $(...) or @(...) is allowed in the replacement text of a substitution. And the closing paren must be last closing paren of the string. That is:

    s/ <?ident> <?rnum> /marker for $(lookup($?ident).' '.from_roman($?rnum)) here/

    is fine, but:

    s/ <?ident> <?rnum> /marker for $(lookup $?ident) $(from_roman $?rnum) here/

    is not.

  • The :first (i.e. match once only between resets) modifier is not implemented.

  • The :u0, :u1, :u2, :u3 modifiers are not implemented.

  • The :perl5 modifier is not supported. If you want a Perl 5 pattern under use Perl6::Rules, just use qr/.../ or a raw /.../ (i.e. no m before the delimiters).

  • "Bare" Perl 6 patterns are not supported. Every Perl 6 pattern must be specified with an explicit rx, m, s, or rule keyword. Bare /.../ patterns and qr/.../ patterns are treated as Perl 5 patterns.

  • The match string's pos is only set correctly when the :cont modifier is specified.

  • You cannot use arbitrary delimiters when specifying a rule. Only m{...}, m[...], m<...>, and m/.../ are supported. Likewise for rx, rule, and s.

  • Lookbehinds (<after...> and <!after...>) are restricted to fixed length patterns.

  • Repetitions must be statically defined (i.e. a variable can't be used in an <n,m> qualifier).

  • The & operator is not yet implemented.

  • Variables used anywhere in a rule/rx pattern must be specified in Perl 6 syntax (i.e. $a[0] always means $a->[0])

  • Any subpattern interpolated by a <$scalar>, <@array>, <%hash>, or <{block}> construct must be precompiled regular expression, not a raw string.

  • <.> does not always work correctly (esp. for combining characters) due to bugs in Perl 5.8.3

  • Due to bugs in the handling of match-time interpolations in the Perl 5.8.3 regex engine, subrules that capture may produce segfaults during or immediately after the match.

  • Due to problems in Perl 5.8.3's handling of lexical variables in patterns (and especially in code blocks inside patterns), the module does not allow lexical variables to be used in Perl 6 rules. To enforce this, all variables used in a Perl 6 rule must include at least one explicit :: in their name. That is:

    our ($keyword, %valid);
    
    # and later...
    
    m/ $::keyword:=(<ident>) <( %::valid{$::keyword} )> /

    but not:

    my ($keyword, %valid);
    
    # and later...
    
    m/ $keyword:=(<ident>) <( %valid{$keyword} )> /
  • The Perl 5 nonstandard L& property (which is equivalent to Lu + Ll + Lt) has been renamed to Lr (mnemonic: Letter-regular).

  • The various "cut" operators (except for :) are not implemented. That is, ::, :::, <commit>, and <cut> are not supported.

  • Rules cannot be specified with parameter lists. Consequently subrules cannot be called with arguments.

WARNING

The syntax and semantics of Perl 6 is still being finalized and consequently is at any time subject to change. That means the same caveat applies to this module.

DEPENDENCIES

Filter::Simple Attribute::Handlers

AUTHOR

Damian Conway (DCONWAY@cpan.org)

BUGS AND IRRITATIONS

No doubt there are many. You are strongly advised not to use this module in production code yet.

Comments, suggestions, and patches are welcome, but due to the volume of email I now receive from Nigerian widows and dispossessed heirs to mining fortunes, I have some very tight mail filters deployed. If you'd like me to actually see your message regarding this module, please include the marker:

[P6R]

somewhere in your subject line.

Also please be patient if I am not able to respond immediately (i.e. within a few months) to your bug report.

SPONSORSHIP

This module was developed under a grant from The Perl Foundation. Hence it was made possible by the generosity of people like yourself. Thank-you.

If you'd like to help the Foundation continue to work for the betterment of the entire Perl community you can find out how at:

http://www.perlfoundation.org/index.cgi?page=contrib

COPYRIGHT

Copyright (c) 2004, The Perl Foundation. All Rights Reserved.
This module is free software. It may be used, redistributed
   and/or modified under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 3413:

You forgot a '=back' before '=head1'