The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

TITLE

Synopsis 5: Rules

AUTHOR

Damian Conway <damian@conway.org> and Allison Randal <al@shadowed.net>

VERSION

  Maintainer: Larry Wall <larry@wall.org>
  Date: 24 Jun 2002
  Last Modified: 9 Dec 2004
  Number: 5
  Version: 7

This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them "rules" because they haven't been regular expressions for a long time. (The term "regex" is still acceptable.)

New match state variable

The underlying match state object is now available as the $/ variable, which is implicitly lexically scoped. All access to the current (or most recent) match are through this variable, even when it doesn't look like it. $1, $2, etc. are just elements of $/, as is the entire matched string, which is available as $0.

Unchanged features

  • Capturing: (...)

  • Repetition quantifiers: *, +, and ?

  • Alternatives: |

  • Backslash escape: \

  • Minimal matching suffix: ??, *?, +?

Modifiers

  • The extended syntax (/x) is no longer required...it's the default.

  • There are no /s or /m modifiers (changes to the meta-characters replace them - see below).

  • There is no /e evaluation modifier on substitutions; instead use:

        s/pattern/{ code() }/
  • Modifiers are now placed as adverbs at the start of a match/substitution:

        @matches = m:g:i/\s* (\w*) \s* ,?/;

    Every modifier must start with its own colon. The delimiter must be separated from the final modifier by a colon or whitespace if it would be taken as an argument to the preceding modifier.

  • The single-character modifiers also have longer versions:

            :i        :ignorecase
            :g        :global
  • The :c (or :continue) modifier causes the pattern to continue scanning from the string's current .pos:

        m:c/ pattern /        # start at end of
                              # previous match on $_

    Note that this does not automatically anchor the pattern to the starting location. (Use :p for that.) The pattern you supply to split has an implicit :c modifier.

  • The :p (or :pos) modifier causes the pattern to try to match only at the string's current .pos:

        m:p/ pattern /        # match at end of
                              # previous match on $_

    Since this is implicitly anchored to the position, it's suitable for building parsers and lexers. The pattern you supply to a Perl macro's "is parsed" trait has an implicit :p modifier.

    Note that

        m:c/pattern/

    is roughly equivalent to

        m:p/.*? pattern/
  • The new :once modifier replaces the Perl 5 ?...? syntax:

        m:once/ pattern /    # only matches first time
  • The new :w (:words) modifier causes whitespace sequences to be replaced by \s* or \s+ subpattern as defined by the <?ws> rule.

        m:w/ next cmd =   <condition>/

    Same as:

        m/ <?ws> next <?ws> cmd <?ws> = <?ws> <condition>/

    which is effectively the same as:

        m/ \s* next \s+ cmd \s* = \s* <condition>/

    But in the case of

        m:w { (a|\*) (b|\+) }

    or equivalently,

        m { (a|\*) <?ws> (b|\+) }

    <?ws> can't decide what to do until it sees the data. It still does the right thing. If not, define your own <?ws> and :w will use that.

  • New modifiers specify Unicode level:

        m:bytes / .**{2} /       # match two bytes
        m:codes / .**{2} /       # match two codepoints
        m:graphs/ .**{2} /       # match two graphemes
        m:langs / .**{2} /       # match two language dependent chars

    There are corresponding pragmas to default to these levels.

  • The new :perl5 modifier allows Perl 5 regex syntax to be used instead:

        m:perl5/(?mi)^[a-z]{1,2}(?=\s)/

    (It does not go so far as to allow you to put your modifiers at the end.)

  • Any integer modifier specifies a count. What kind of count is determined by the character that follows.

  • If followed by an x, it means repetition. Use :x(4) for the general form. So

        s:4x { (<?ident>) = (\N+) $$}{$1 => $2};

    is the same as:

        s:x(4) { (<?ident>) = (\N+) $$}{$1 => $2};

    which is almost the same as:

        $_.pos = 0;
        s:c{ (<?ident>) = (\N+) $$}{$1 => $2} for 1..4;

    except that the string is unchanged unless all four matches are found. However, ranges are allowed, so you can say :x(1..4) to change anywhere from one to four matches.

  • If the number is followed by an st, nd, rd, or th, it means find the Nth occurrence. Use :nth(3) for the general form. So

        s:3rd/(\d+)/@data[$1]/;

    is the same as

        s:nth(3)/(\d+)/@data[$1]/;

    which is the same as:

        m/(\d+)/ && m:c/(\d+)/ && s:c/(\d+)/@data[$1]/;

    Lists and junctions are allowed: :nth(1|2|3|5|8|13|21|34|55|89).

    So are closures: :nth{.is_fibonacci}

  • With the new :ov (:overlap) modifier, the current rule will match at all possible character positions (including overlapping) and return all matches in a list context, or a disjunction of matches in a scalar context. The first match at any position is returned.

        $str = "abracadabra";
    
        @substrings = $str ~~ m:overlap/ a (.*) a /;
    
        # bracadabr cadabr dabr br
  • With the new :ex (:exhaustive) modifier, the current rule will match every possible way (including overlapping) and return all matches in a list context, or a disjunction of matches in a scalar context.

        $str = "abracadabra";
    
        @substrings = $str ~~ m:exhaustive/ a (.*) a /;
    
        # br brac bracad bracadabr c cad cadabr d dabr br
  • The new :rw modifier causes this rule to "claim" the current string for modification rather than assuming copy-on-write semantics. All the bindings in $/ become lvalues into the string, such that if you modify, say, $1, the original string is modified in that location, and the positions of all the other fields modified accordingly (whatever that means). In the absence of this modifier (especially if it isn't implemented yet, or is never implemented), all pieces of $/ are considered copy-on-write, if not read-only.

  • The new :keepall modifier causes this rule and all invoked subrules to remember everything, even if the rules themselves don't ask for their subrules to be remembered. This is for forcing a grammar that throws away whitespace and comments to keep them instead.

  • The :i, :w, :perl5, and Unicode-level modifiers can be placed inside the rule (and are lexically scoped):

        m/:w alignment = [:i left|right|cent[er|re]] / 
  • User-defined modifiers will be possible:

            m:fuzzy/pattern/;
  • User-defined modifiers can also take arguments:

            m:fuzzy('bare')/pattern/;
  • To use parens or brackets for your delimiters you have to separate:

            m:fuzzy (pattern);
            m:fuzzy:(pattern);

    or you'll end up with:

            m:fuzzy(fuzzyargs); pattern ;

Changed metacharacters

  • A dot . now matches any character including newline. (The /s modifier is gone.)

  • ^ and $ now always match the start/end of a string, like the old \A and \z. (The /m modifier is gone.)

  • A $ no longer matches an optional preceding \n so it's necessary to say \n?$ if that's what you mean.

  • \n now matches a logical (platform independent) newline not just \012.

  • The \A, \Z, and \z metacharacters are gone.

New metacharacters

  • Because /x is default:

    • # now always introduces a comment.

    • Whitespace is now always metasyntactic, i.e. used only for layout and not matched literally (but see the :w modifier described above).

  • ^^ and $$ match line beginnings and endings. (The /m modifier is gone.) They are both zero-width assertions. $$ matches before any \n (logical newline), and also at the end of the string if the final character was not a \n. ^^ always matches the beginning of the string and after any \n that is not the final character in the string.

  • . matches an "anything", while \N matches an "anything except newline". (The /s modifier is gone.) In particular, \N matches neither carriage return nor line feed.

  • The new & metacharacter separates conjunctive terms. The patterns on either side must match with the same beginning and end point. The operator is list associative like |, and backtracking makes the right argument vary faster than the left.

Bracket rationalization

  • (...) still delimits a capturing group.

  • [...] is no longer a character class. It now delimits a non-capturing group.

  • {...} is no longer a repetition quantifier. It now delimits an embedded closure.

  • You can call Perl code as part of a rule match by using a closure. Embedded code does not usually affect the match--it is only used for side-effects:

        / (\S+) { print "string not blank\n"; $text = $1; }
           \s+  { print "but does contain whitespace\n" }
        /
  • It can affect the match if it calls fail:

        / (\d+) { $1 < 256 or fail } /

    Closures are guaranteed to be called at the canonical time even if the optimizer could prove that something after them can't match. (Anything before is fair game, however.)

  • The repetition specifier is now **{...} for maximal matching, with a corresponding or **{...}? for minimal matching. Space is allowed on either side of the asterisks. The curlies are taken to be a closure returning a number or a range.

        / value was (\d ** {1..6}?) with ([\w]**{$m..$n}) /

    It is illegal to return a list, so this easy mistake fails:

        / [foo]**{1,3}

    (At least, it fails in the absence of "use rx :listquantifier", which is likely to be unimplemented in Perl 6.0.0 anyway).

    The optimizer will likely optimize away things like **{1...} so that the closure is never actually run in that case. But it's a closure that must be run in the general case, so you can use it to generate a range on the fly based on the earlier matching. (Of course, bear in mind the closure is run before attempting to match whatever it quantifies.)

  • <...> are now extensible metasyntax delimiters or "assertions" (i.e. they replace Perl 5's crufty (?...) syntax).

Variable (non-)interpolation

  • In Perl 6 rules, variables don't interpolate.

  • Instead they're passed "raw" to the rule engine, which can then decide how to handle them (more on that below).

  • The default way in which the engine handles a scalar is to match it as a <'...'> literal (i.e.it does not treat the interpolated string as a subpattern). In other words, a Perl 6:

        / $var /

    is like a Perl 5:

        / \Q$var\E /

    (To get rule interpolation use an assertion - see below)

  • An interpolated array:

        / @cmds /

    is matched as if it were an alternation of its elements:

        / [ @cmds[0] | @cmds[1] | @cmds[2] | ... ] /

    As with a scalar variable, each one is matched as a literal.

  • An interpolated hash matches the longest possible key of the hash as a literal, or fails if no key matches. (A "" key will match anywhere, provided no longer key matches.)

    • If the corresponding value of the hash element is a closure, it is executed.

    • If it is a string or rule object, it is executed as a subrule.

    • If it has the value 1, nothing special happens beyond the match.

    • Any other value causes the match to fail.

Extensible metasyntax (<...>)

  • The first character after < determines the behaviour of the assertion.

  • A leading alphabetic character means it's a grammatical assertion (i.e. a subpattern or a named character class - see below):

        / <sign>? <mantissa> <exponent>? /
  • The special named assertions include:

        / <before pattern> /    # was /(?=pattern)/
        / <after pattern> /     # was /(?<pattern)/ 
    
        / <?ws> /                # match whitespace by :w rules
    
        / <?sp> /                # match a space char

    The after assertion implements lookbehind by reversing the syntax tree and looking for things in the opposite order going to the left. It is illegal to do lookbehind on a pattern that cannot be reversed.

  • A leading $ indicates an indirect rule. The variable must contain either a hard reference to a rule, or a string containing the rule.

  • A leading :: indicates a symbolic indirect rule:

        / <::($somename)>

    The variable must contain the name of a rule.

  • A leading @ matches like a bare array except that each element is treated as a rule (string or hard ref) rather than as a literal.

  • A leading % matches like a bare hash except that each key is treated as a rule (string or hard ref) rather than as a literal.

  • A leading { indicates code that produces a rule to be interpolated into the pattern at that point:

        / (<?ident>)  <{ %cache{$1} //= get_body($1) }> /

    The closure is guaranteed to be run at the canonical time.

  • A leading & interpolates the return value of a subroutine call as a rule. Hence

        <&foo()>

    is short for

        <{ foo() }>
  • In any case of rule interpolation, if the value already happens to be a rule object, it is not recompiled. If it is a string, the compiled form is cached with the string so that it is not recompiled next time you use it unless the string changes. (Any external lexical variable names must be rebound each time though.) Rules may not be interpolated with unbalanced bracketing. An interpolated subrule keeps its own inner $/, so its parentheses never count toward the outer rules groupings. (In other words, parenthesis numbering is always lexically scoped.)

  • A leading ( indicates a code assertion:

        / (\d**{1..3}) <( $1 < 256 )> /

    Similar to:

        / (\d**{1..3}) { $1 < 256 or fail } /

    Unlike closures, code assertions are not guaranteed to be run at the canonical time if the optimizer can prove something later can't match. So you can sneak in a call to a non-canonical closure that way:

        /^foo .* <( do { say "Got here!" } or 1 )> .* bar$/

    The do block is unlikely to run unless the string ends with "bar".

  • A leading [ or + indicates an enumerated character class:

        / <[a-z_]>* /
        / <+[a-z_]>* /

    The + is required if starting with an angled character class:

        / <+<alpha>> /

    Without the plus it would be interpreted as

        / <?alpha> /

    which means something else (see below).

  • A leading - indicates a complemented character class:

        / <-[a-z_]> <-<alpha>> /
  • A leading ' indicates a literal match (including whitespace):

        / <'match this exactly (whitespace matters)'> /
  • A leading " indicates a literal match after interpolation:

        / <"match $THIS exactly (whitespace still matters)"> /
  • The special assertion <.> matches any logical grapheme (including a Unicode combining character sequences):

        / seekto = <.> /  # Maybe a combined char

    Same as:

        / seekto = [:graphs .] /
  • A leading ! indicates a negated meaning (always a zero-width assertion):

        / <!before _ > /    # We aren't before an _

Backslash reform

  • The \p and \P properties become intrinsic grammar rules (<prop ...> and <!prop ...>).

  • The \L...\E, \U...\E, and \Q...\E sequences are gone. In the rare cases that need them you can use <{ lc $rule }> etc.

  • The \G sequence is gone. Use :p instead. (Note, however, that it makes no sense to use :p within a pattern, since every internal pattern is implicitly anchored to the current position. You'll have to explicitly compare <( .pos == $oldpos )> in that case.)

  • Backreferences (e.g. \1) are gone; $1 can be used instead, because variables are no longer interpolated.

  • New backslash sequences, \h and \v, match horizontal and vertical whitespace respectively, including Unicode.

  • \s now matches any Unicode whitespace character.

  • The new backslash sequence \N matches anything except a logical newline; it is the negation of \n.

  • A series of other new capital backslash sequences are also the negation of their lower-case counterparts:

    • \H matches anything but horizontal whitespace.

    • \V matches anything but vertical whitespace.

    • \T matches anything but a tab.

    • \R matches anything but a return.

    • \F matches anything but a formfeed.

    • \E matches anything but an escape.

    • \X... matches anything but the specified hex character.

Regexes are rules

  • The Perl 5 qr/pattern/ regex constructor is gone.

  • The Perl 6 equivalents are:

        rule { pattern }    # always takes {...} as delimiters
        rx / pattern /      # can take (almost any) chars as delimiters

    You may not use whitespace or alphanumerics for delimiters. Space is optional unless needed to distinguish from modifier arguments or function parens. So you may use parens as your rx delimiters, but only if you interpose a colon or whitespace:

        rx:( pattern )      # okay
        rx ( pattern )      # okay
        rx( 1,2,3 )         # tries to call rx function
  • If either form needs modifiers, they go before the opening delimiter:

        $rule = rule :g:w:i { my name is (.*) };
        $rule = rx:g:w:i / my name is (.*) /;

    Space or colon is necessary after the final modifer if you use any bracketing character for the delimiter. (Otherwise it would be taken as an argument to the modifier.)

  • You may not use colons for the delimiter. Space is allowed between modifiers:

        $rule = rx :g :w :i / my name is (.*) /;
  • The name of the constructor was changed from qr because it's no longer an interpolating quote-like operator. rx stands for "rule expression", or occasionally "regex". :-)

  • As the syntax indicates, it is now more closely analogous to a sub {...} constructor. In fact, that analogy will run very deep in Perl 6.

  • Just as a raw {...} is now always a closure (which may still execute immediately in certain contexts and be passed as a reference in others), so too a raw /.../ is now always a rule (which may still match immediately in certain contexts and be passed as a reference in others).

  • Specifically, a /.../ matches immediately in a value context (void, Boolean, string, or numeric), or when it is an explicit argument of a ~~. Otherwise it's a rule constructor. So this:

        $var = /pattern/;

    no longer does the match and sets $var to the result. Instead it assigns a rule reference to $var.

  • The two cases can always be distinguished using m{...} or rx{...}:

        $var = m{pattern};    # Match rule immediately, assign result
        $var = rx{pattern};   # Assign rule expression itself
  • Note that this means that former magically lazy usages like:

        @list = split /pattern/, $str;

    are now just consequences of the normal semantics.

  • It's now also possible to set up a user-defined subroutine that acts like grep:

        sub my_grep($selector, *@list) {
            given $selector {
                when Rule  { ... }
                when Code  { ... }
                when Hash  { ... }
                # etc.
            }
        }

    Using {...} or /.../ in the scalar context of the first argument causes it to produce a Code or Rule reference, which the switch statement then selects upon.

Backtracking control

  • Backtracking over a single colon causes the rule engine not to retry the preceding atom:

        m:w/ \( <expr> [ , <expr> ]* : \) /

    (i.e. there's no point trying fewer <expr> matches, if there's no closing parenthesis on the horizon)

  • Backtracking over a double colon causes the surrounding group of alternations to immediately fail:

        m:w/ [ if :: <expr> <block>
             | for :: <list> <block>
             | loop :: <loop_controls>? <block>
             ]
        /

    (i.e. there's no point trying to match a different keyword if one was already found but failed).

  • Backtracking over a triple colon causes the current rule to fail outright (no matter where in the rule it occurs):

        rule ident {
              ( [<alpha>|_] \w* ) ::: { fail if %reserved{$1} }
            | " [<alpha>|_] \w* "
        }
    
        m:w/ get <ident>? /

    (i.e. using an unquoted reserved word as an identifier is not permitted)

  • Backtracking over a <commit> assertion causes the entire match to fail outright, no matter how many subrules down it happens:

        rule subname {
            ([<alpha>|_] \w*) <commit> { fail if %reserved{$1} }
        }
        m:w/ sub <subname>? <block> /

    (i.e. using a reserved word as a subroutine name is instantly fatal to the "surrounding" match as well)

  • A <cut> assertion always matches successfully, and has the side effect of deleting the parts of the string already matched.

  • Attempting to backtrack past a <cut> causes the complete match to fail (like backtracking past a <commit>. This is because there's now no preceding text to backtrack into.

  • This is useful for throwing away successfully processed input when matching from an input stream or an iterator of arbitrary length.

Named Regexes

  • The analogy between sub and rule extends much further.

  • Just as you can have anonymous subs and named subs...

  • ...so too you can have anonymous rules and named rules:

        rule ident { [<alpha>|_] \w* }
    
        # and later...
    
        @ids = grep /<ident>/, @strings;
  • As the above example indicates, it's possible to refer to named rules, such as:

        rule serial_number { <[A-Z]> \d**{8} })
        rule type { alpha | beta | production | deprecated | legacy }

    in other rules as named assertions:

        rule identification { [soft|hard]ware <type> <serial_number> }

Nothing is illegal

  • The null pattern is now illegal.

  • To match whatever the prior successful rule matched, use:

        /<prior>/
  • To match the zero-width string, use:

        /<null>/

    For example:

        split /<?null>/, $string

    splits between characters.

  • To match a null alternative, use:

        /a|b|c|<?null>/

    This makes it easier to catch errors like this:

        m:w/ [
             | if :: <expr> <block>
             | for :: <list> <block>
             | loop :: <loop_controls>? <block>
             ]
        /
  • However, it's okay for a non-null syntactic construct to have a degenerate case matching the null string:

        $something = "";
        /a|b|c|$something/;

Rule-scoped variables

  • Within a rule, the current state of the match is called $/.

  • The $/ object contains all the current bindings of any submatches, so it includes $0, $1, $2, etc.

  • The $/ object also contains a hash of named variables lexically scoped to the current rule. These may be named by hash subscripting: $/{'foo'} or $/<foo>. For readability the latter form the $/ may be replaced with an appropriate sigil: $<foo>.

Hypothetical variables

  • In embedded closures it's possible to bind a variable to a value that only "sticks" if the surrounding pattern successfully matches.

  • As with temp variables, a hypothetical variable is set with the keyword let followed by binding or assignment of the desired value:

        my $num;
        / (\d+) {let $num := $1} (<alpha>+)/

    (The value bound or assigned need not be a part of the string being searched.)

  • Now $num will only be bound to $1 if the digits are actually found. (Within a rule it's generally better to use binding rather than assignment to avoid excessive copying of data values you might later just throw away.)

  • You may also set rule-scoped variables hypothetically:

        / (\d+) {let $<num> := $1} (<alpha>+)/

    In this case the variable lives in $/ rather than the surrounding lexical pad.

  • If the match ever backtracks past the closure (i.e. if there are no alphabetics following), the binding is "undone", and the variable reverts to its previous value (or lack thereof).

  • This is even more interesting in alternations:

        / [ (\d+)      { let $<num>   := $1 }
          | (<alpha>+) { let $<alpha> := $2 }
          | (.)        { let $<other> := $3 }
          ]
        /
  • There is also a shorthand for binding to variables hypothetically:

        / [ $<num>  := (\d+)
          | $<alpha>:= (<alpha>+)
          | $<other>:= (.)
          ]
        /

    Either way you write it, if an alternative is bypassed, a variable that has never been bound to a value evaluates to undefined.

  • There is no shorthand for assignment.

  • The binding shorthand is right associative just like the ordinary operator.

        our $cheater;
        sub foo { say $cheater }
        / $cheater := $<capture> := (\d+) { foo() } /
  • When implicitly bound by captures, the numeric variables ($1, $2, etc.) are automatically bound "hypothetically", as if you'd said

        { let $1 := "string" }

    If you explicitly say

        { $1 := "string" }

    the variable is not bound hypothetically, so $1 persists as long as the current match object persists, even if this closure is backtracked over. However, since $/ is itself a hypothetical variable, the lifetime of any numbered variable is somewhat hypothetical in any case, scoped to some outer calling context, either a calling pattern, or the current non-pattern scope that invoked this pattern. This implies that, outside the entire match, a failed, false $/ can return a defined, non-hypothetically bound $1, up till the time the $/ object is itself destroyed.

  • Numeric variables can be bound, and even re-ordered:

        my ($key, $val) = m:w{ $1:=(\w+) =\> $2:=(.*?)
                             | $2:=(.*?) \<= $1:=(\w+)
                             };

    If you bind any numeric variables, you have to set them all, because automatic binding of numeric variables is disabled elsewhere in the rule. So the example above doesn't automatically set $3 or $4.

  • You may also bind the numeric variables explicitly in closures, in which case the normal numbering is not disabled:

        my ($key, $val) = m:w{ (\w+) =\> (.*?)
                             | (.*?) \<= (\w+) { ($2,$1) := ($3,$4) }
                             };

    But you have to be careful to do things in the right order to avoid clobbering a value you're going to want later.

  • Repeated captures can be bound to arrays:

        / @<values> := [ (.*?) , ]* /
  • Pairs of repeated captures can be bound to hashes:

        / %<options> := [ (<?ident>) = (\N+) ]* /
  • Or just capture the keys (and leave the values undef):

        / %<options> := [ (<?ident>) = \N+ ]* /
  • As a general rule, if you capture anything repeating with a scalar destination, you end up with a list reference as the scalar value (even if in the actual case only one thing matches). This always puts a list into the scalar:

        / $<foo> := [ (.*?) , ]* /

    And this puts a list of lists:

        / $<bar> := [ <ident> = (\N+) ]* /
  • By default, any assertion beginning with an alphabetic identifier (e.g. <rule ...>) captures its results into a variable of that name. To suppress capture, precede the identifier with a question mark:

        / <key> <?ws> =\> <?ws> <value> { %hash{$<key>} = $<value> } /

    If you have multiple rules of the same name, they are by default captured as a list under that name. You may bind them to separate names if you wish.

  • All rules remember everything if :keepall is in effect anywhere in the outer dynamic scope. In this case everything inside the angles is used as part of the key. Suppose the earlier example parsed whitespace:

        / <key> <?ws> <'=>'> <?ws> <value> { %hash{$<key>} = $<value> } /

    The two instances of <?ws> above would store an array of two values accessible as @<?ws>. It would also store the literal match into $<'=\>'>. Just to make sure nothing is forgotten, under :keepall any text or whitespace not otherwise remembered is attached as an extra property on the subsequent node. (The name of that property is "pretext".)

Return values from matches

  • A match always returns a "match object", which is also available as (lexical) $/ (except within a closure lexically embedded in a rule, where $/ always refers to the current match, not any submatch done within the closure).

  • The match object evaluates differently in different contexts:

    • In boolean context it evaluates as true or false (i.e. did the match succeed?):

          if /pattern/ {...}
          # or:
          /pattern/; if $/ {...}
    • In numeric context it evaluates to the number of matches:

          $match_count += m:g/pattern/;
    • In string context it evaluates to $0, the entire matched string:

          print %hash{"{$text ~~ /<?ident>/}"};
          # or equivalently: 
          $text ~~ /<?ident>/  &&  print %hash{~$/};

      But generally you should say $0 if you mean $0.

    • When used as an array, $/ pretends to be an array containing $1, $2, etc. Hence

          ($key, $val) = m:w/ (\S+) => (\S+)/;

      can also be written:

          $result = m:w/ (\S+) => (\S+)/;
          ($key, $val) = @$result;

      For this reason $0 is not returned as part of the list (unless there are no other captures).

      To get a single capture into a string, use a subscript:

          $mystring = "{ m:w/ (\S+) => (\S+)/[0] }";

      To get all the captures into a string, use a "zen" slice:

          $mystring = "{ m:w/ (\S+) => (\S+)/[] }";

      Note that, as a scalar variable, $/ doesn't automatically flatten in list context. Use @$/ or $/[] to flatten as an array.

    • When used as a hash, $/ pretends to be a hash of all the named captures. The keys do not include any sigils, so if you capture to variable @<foo> its real name is $/{'foo'} or $/<foo>. However, you may still refer to it as @<foo> anywhere $/ is visible. (But it is erroneous to use the same name for two different capture datatypes.)

      Note that, as a scalar variable, $/ doesn't automatically flatten in list context. Use %$/ or $/{} to flatten as a hash, or bind it to a variable of the appropriate type.

    • The numbered captures may be treated as named, so $<1 2 3> is equivalent to $/[0,1,2]. This allows you to write slices of intermixed named and numbered captures. $0 and $<0> are the same thing, but there is no way to get at $0 via the array subscript notation. (Saying $/[-1] returns the final capture, not $0.)

    • Note that $<0> is not merely a recursive reference to $/, because they are of different types. $/ is a polymorphic match object, while $0 is a string (presuming you're matching a string). The two objects take different sets of methods.

    • In ordinary code, variables $1, $2, etc. are just aliases into $/[0], $/[1], etc. Hence they will all be undefined if the last match failed (unless they were explicitly bound in a closure without using the let keyword).

  • Within a rule, $/ acts like a hypothetical variable.

  • It controls what a rule match returns (like $$ does in yacc)

  • Use $0:= to override the default return behaviour described above:

        rule string1 { (<["'`]>) ([ \\. | <-[\\]> ]*?) $1 }
    
        $match = m/<string1>/;  # default: $match includes 
                                # opening and closing quotes
    
    
        rule string2 { (<["'`]>) $0:=([ \\. | <-[\\]> ]*?) $1 }
    
        $match = m/<string2>/;  # $match now excludes quotes
                                # because $0 explicitly bound 
                                # to second capture only

    This influences the string returned by $0, as well as where the rule thinks it started and stopped matching. (That is, .pos is set to the right end of the forced $0.) However, all the numbered and named fields are available through $/ just as if the entire match had been returned for $0. So the quote character still shows up in $1.

  • When binding $0 in a closure, it is syntactically valid to bind anything. However, it is potentially erroneous to bind $0 to anything that is not part of the string being matched, since it might confuse .pos completely. On the other hand, maybe that's a convenient way to redirect the match to continue in a different string entirely.

    The $/ variable may not be rebound within a rule.

Matching against non-strings

  • Anything that can be tied to a string can be matched against a rule. This feature is particularly useful with input streams:

        my $stream is from($fh);       # tie scalar to filehandle
    
        # and later...
    
        $stream ~~ m/pattern/;         # match from stream

    An array can be matched against a rule. The special <,> rule matches the boundary between elements. If the array elements are strings, they are concatenated virtually into a single logical string. If the array elements are tokens or other such objects, the objects must provide appropriate methods for the kinds of rules to match against. It is an assertion error to match a string matching assertion against an object that doesn't provide a string view. However, pure token objects can be parsed as long as the match rule restricts itself to assertions like:

        <.isa(Dog)>
        <.does(Bark)>
        <.can('scratch')>

    It is permissible to mix tokens and strings in an array as long as they're in different elements. You may not embed objects in strings, however.

    To match against each element of an array, use a hyper operator:

        @array».match($rule)

Grammars

  • Your private ident rule shouldn't clobber someone else's ident rule. So some mechanism is needed to confine rules to a namespace.

  • If subs are the model for rules, then modules/classes are the obvious model for aggregating them. Such collections of rules are generally known as "grammars".

  • Just as a class can collect named actions together:

        class Identity {
            method name { "Name = $.name" }
            method age  { "Age  = $.age"  }
            method addr { "Addr = $.addr" }
    
            method desc {
                print .name(), "\n",
                      .age(),  "\n",
                      .addr(), "\n";
            }
    
            # etc.
        }

    so too a grammar can collect a set of named rules together:

        grammar Identity {
            rule name :w { Name = (\N+) }
            rule age  :w { Age  = (\d+) }
            rule addr :w { Addr = (\N+) }
            rule desc {
                <name> \n
                <age>  \n
                <addr> \n
            }
    
            # etc.
        }
  • Like classes, grammars can inherit:

        grammar Letter {
            rule text     { <greet> <body> <close> }
    
            rule greet :w { [Hi|Hey|Yo] $to:=(\S+?) , $$}
    
            rule body     { <line>+ }
    
            rule close :w { Later dude, $from:=(.+) }
    
            # etc.
        }
    
        grammar FormalLetter is Letter {
    
            rule greet :w { Dear $to:=(\S+?) , $$}
    
            rule close :w { Yours sincerely, $from:=(.+) }
    
        }
  • Just like the methods of a class, the rule definitions of a grammar are inherited (and polymorphic!). So there's no need to respecify body, line, etc.

  • Perl 6 will come with at least one grammar predefined:

        grammar Perl {    # Perl's own grammar
    
            rule prog { <statement>* }
    
            rule statement { <decl>
                      | <loop>
                      | <label> [<cond>|<sideff>|;]
            }
    
            rule decl { <sub> | <class> | <use> }
    
            # etc. etc. etc.
        }
  • Hence:

        given $source_code {
            $parsetree = m/<Perl.prog>/;
        }

Syntactic categories

For writing your own backslash and assertion rules or macros, you may use the following syntactic categories:

    rule rxbackslash:<w> { ... }    # define your own \w and \W
    rule rxassertion:<*> { ... }    # define your own <*stuff>
    macro rxmetachar:<,> { ... }    # define a new metacharacter
    macro rxmodinternal:<x> { ... } # define your own /:x() stuff/
    macro rxmodexternal:<x> { ... } # define your own m:x()/stuff/

As with any such syntactic shenanigans, the declaration must be visible in the lexical scope to have any effect. It's possible the internal/external distinction is just a trait, and that some of those things are subs or methods rather than rules or macros. (The numeric rxmods are recognized by fallback macros defined with an empty operator name.)

Pragmas

The rx pragma may be used to control various aspects of regex compilation and usage not otherwise provided for.

Transliteration

  • The tr/// quote-like operator now also has a method form called trans(). Its argument is a list of pairs. You can use anything that produces a pair list:

        $str.trans( %mapping.pairs.sort );

    Use the .= form to do a translation in place:

        $str.=trans( %mapping.pairs.sort );
  • The two sides of the any pair can be strings interpreted as tr/// would:

        $str.=trans( 'A-C' => 'a-c', 'XYZ' => 'xyz' );

    As a degenerate case, each side can be individual characters:

        $str.=trans( 'A'=>'a', 'B'=>'b', 'C'=>'c' );
  • The two sides of each pair may also be array references:

        $str.=trans( ['A'..'C'] => ['a'..'c'], <X Y Z> => <x y z> );
  • The array version can map one-or-more characters to one-or-more characters:

        $str.=trans( [' ',      '<',    '>',    '&'    ] =>
                     ['&nbsp;', '&lt;', '&gt;', '&amp;' ]);

    In the case that more than one sequence of input characters matches, the longest one wins. In the case of two identical sequences the first in order wins.

    There are also method forms of m// and s//:

        $str.match(//);
        $str.subst(//, "replacement")
        $str.subst(//, {"replacement"})
        $str.=subst(//, "replacement")
        $str.=subst(//, {"replacement"})

1 POD Error

The following errors were encountered while parsing the POD:

Around line 1402:

Non-ASCII character seen before =encoding in '@array».match($rule)'. Assuming UTF-8