The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Regexp::English - Perl module to create regular expressions more verbosely

SYNOPSIS

        use Regexp::English;

        my $re = Regexp::English
                -> start_of_line
                -> literal('Flippers')
                -> literal(':')
                -> optional
                        -> whitespace_char
                -> end
                -> remember
                        -> multiple
                                -> digit;

        while (<INPUT>) {
                if (my $match = $re->match($_)) {
                        print "$match\n";
                }
        }

DESCRIPTION

Regexp::English provides an alternate regular expression syntax, one that is slightly more verbose than the standard mechanisms. In addition, it adds a few convenient features, like incremental expression building and bound captures.

You can access almost every regular expression available in Regexp::English can through a method, though some are also (or only) available as functions. These methods fall into several categories: characters, quantifiers, groupings, and miscellaneous. The division wouldn't be so rough if the latter had a better name.

All methods return the Regexp::English object, so you can chain method calls as in the example above. Though there is a new() method, you can use any character method, or remember(), to create an object.

To perform a match, use the match() method. Alternately, if you use a Regexp::English object as if it were a compiled regular expression, the module will automatically compile it behind the scenes.

Characters

Character methods correspond to standard regular expression characters and metacharacters, for the most part. As a little bit of syntactic sugar, most of these methods have plurals, negations, and negated plurals. This is more clear looking at them. Though the point of these is to be available as calls on a new Regexp::English object while building up larger regular expressions, you may also used them as class methods to access regular expression atoms which you then use in larger regular expressions. This isn't entirely pretty, but it ought to work just about everywhere.

  • literal( $string )

    Matches the provided literal string. This method passes $string through quotemeta() automatically. If you receive strange results, it's probably because of this.

  • class( @characters )

    Creates and matches a character class of the provided @characters. Note that there is currently no validation of the character class, so you can create an uncompilable regular expression if you're not careful.

  • word_char()

    Matches any word character, respecting the current locale. By default, this matches alphanumerics and the underscore, corresponding to the \w token.

  • word_chars()

    Matches at least one word character.

  • non_word_char()

    Matches any non-word character.

  • non_word_chars()

    Matches at least one non-word character.

  • whitespace_char()

    Matches any whitespace character, corresponding to the \s token.

  • whitespace_chars()

    Matches at least one whitespace characters.

  • non_whitespace_char()

    Matches a single non-whitespace character.

  • non_whitespace_chars()

    Matches at least one non-whitespace characters.

  • digit()

    Matches any numeric digit, corresponding to the \d token.

  • digits()

    Matches at least one numeric digits.

  • non_digit()

    Matches a character that is not a digit.

  • non_digits()

    Matches at least one non-digit characters.

  • tab()

    Matches a tab character (\t)

  • tabs()

    Matches at least one tab characters.

  • non_tab()

    Matches any character that is not a tab.

  • newline()

    Matches a newline character (\n). This implies the /s modifier.

  • newlines()

    Matches at least one newline characters. This also implies the /s modifier.

  • non_newline()

    Matches any character that is not a newline.

  • carriage_return()

    Matches a carriage return character (\r).

  • carriage_returns()

    Matches at least one carriage return characters.

  • non_carriage_return()

    Matches any character that is not a carriage return.

  • form_feed()

    Matches a form feed character (\f).

  • form_feeds()

    Matches at least one form feed characters.

  • non_form_feed()

    Matches any character that is not a form feed character.

  • alarm()

    Matches an alarm character (\a).

  • alarms()

    Matches more than one alarm character.

  • non_alarm()

    Matches anything but an alarm character.

  • escape()

    Matches an escape character (\e).

  • escapes()

    Matches at least one escape characters.

  • non_escape()

    Matches a single non-escape character.

  • start_of_line()

    Matches the start of a line, just like the ^ anchor.

  • beginning_of_string()

    Matches the beginning of a string, much like the ^ anchor.

  • end_of_line()

    Matches the end of a line, just like the $ anchor.

  • end_of_string()

    Matches the end of a string, much like the $ anchor, treating newlines appropriately depending on the /s or /m modifier.

  • very_end_of_string()

    Matches the very end of a string, just as the \z token. This does not ignore a trailing newline (if it exists).

  • end_of_previous_match()

    Matches the point at which a previous match ended, in a \globally-matched regular expression. This corresponds to the \G token and relates to pos().

  • word_boundary()

    Matches the zero-width boundary between a word character and a non-word character, corresponding to the \b token.

  • non_word_boundary()

    Matches anything that is not a word boundary.

Quantifiers

Quantifiers provide a mechanism to specify how many items to expect, in general or specific terms. You may have these exported into the calling package's namespace with the :standard argument to the use() call, but the preferred interface is to use them as method calls. This is slightly more complicated, but cleaner conceptually. The interface may change slightly in the future, if someone comes up with something even better.

By default, quantifiers operate on the next arguments, not the previous ones. (It is much easier to program this way.) For example, to match multiple digits, you might write:

        my $re = Regexp::English->new()
                ->multiple()
                        ->digits();

The indentation should make this more clear.

Quantifiers persist until something attempts a match or something calls the corresponding end() method. As match() calls end() internally, attempting a match closes all active quantifiersThere is currently no way to re-open a quantifier even if you add to a Regexp::English object. This is a non-trivial problem (as the author understands it), and there's no good solution for it in normal regular expressions anyway.

If you have imported the quantifiers, you can pass the quantifiables as arguments:

        use Regexp::English ':standard';

        my $re = Regexp::English->new()
                ->multiple('a');

This closes the open quantifier for you automatically. Though this syntax is slightly more visually appealing, it does involve exporting quite a few methods into your namespace, so it is not the default behavior. Besides that, if you get in this habit, you'll eventually have to use the :all tag. It's better to make a habit of using the method calls, or to push Vahe to write Regexp::Easy. :)

  • zero_or_more()

    Matches as many items as possible. Note that "possible" includes matching zero items. Note also that "item" means "whatever you told it to match". By default, this is greedy.

  • multiple()

    Matches at least one item, but as many as possible. By default, this is greedy.

  • optional()

    Marks an item as optional so that the pattern will match with or without the item.

  • minimal()

    This quantifier modifies either zero_or_more() or multiple(), and disables greediness, asking for as few matches as possible.

Groupings

Groupings function much the same as quantifiers, though they have semantic differences. The most important similarity is that you can use them with the function or the method interface. The method interface is nicer, but see the documentation for end() for more information.

Groupings generally correspond to advanced Perl regular expression features like lookaheads and lookbehinds. If you find yourself using them on a regular basis, you're probably ready to graduate to hand-rolled regular expressions (or to contribute code to improve Regexp::English.

  • comment()

    Marks the item as a comment, which has no bearing on the match and really doesn't give you anything here either. Don't let that stop you, though.

  • group()

    Groups items together (often to use a single quantifier on them) without actually capturing them. This isn't very useful either, because the Quantifiers handle this for you.

  • followed_by()

    Marks the item as a zero-width positive look-ahead assertion. This means that the pattern must match the item after the previous bits, but the item is not part of the matched string as far as captures and pos() care.

  • not_followed_by()

    Marks the item as a zero-width negative look-ahead assertion. This means that the pattern must not match the item after the previous bits. Again, the item is not part of the matched string.

  • after()

    Marks the item as a zero-width positive look-behind assertion. This means the pattern must match the item before the following bits. This is super funky, and may have subtle bugs -- look-behinds tend to need fixed width items, and Regexp::English currently doesn't enforce this.

  • not_after()

    Marks the item as a zero-width negative look-behind assertion. This means the pattern must not match the item before the following bits. The fixed-width rule also applies here.

Miscellaneous

These subroutines don't really fit anywhere else. They're useful, and mostly cool.

  • new()

    Creates a new Regexp::English object. Though some methods do this for you automagically if you need one, this is the best way to start a regular expression.

  • match()

    Compiles and attempts to match the Regexp::English object against a passed-in regular expression. This will return any captured variables if they exist and if the match succeeds. If there are no captures, this will return a true or false value depending on whether the match succeeds.

  • remember()

    Causes Regexp::English to remember an item which match() will capture and return (or otherwise make available). Normally, match() returns these items are in order of their declaration within the regular expression. You can also bind them to variables. Pass in a reference to a scalar as the first argument, and match() will automagically populate the scalar with the matched value on each subsequent match. That means you can write:

            my ($first, $second);
    
            my $re = Regexp::English->new()
                    ->remember(\$first)
                            ->multiple('a')
                            ->remember(\$second)
                                    ->word_char();
    
            for my $match (qw( aab aaac ad ))
            {
                    print "$second\t$first\n" if $re->match($match);
            }

    This will print:

            b       aaab
            c       aac
            d       ad
  • end()

    Ends an open Quantifier or Grouping. If you pass no arguments, it will end only the most recently opened item. If you pass a numeric argument, it will end that many recently opened items. It does not currently check to see if you pass in a number, so only pass in numbers, or be ready to handle odd results.

  • compile()

    Compiles and returns the pattern-in-progress, ending any and all open Quantifier or Groupings. This uses qr//. Note that any operation which stringifies the object will call this method. This appears to include treating a Regexp::English object as a regular expression. Nifty.

  • or()

    Provides alternation capabilities. The preferred interface is very similar to Grouping calls:

            my $re = Regexp::English->new()
                    ->group()
                            ->digit()
                            ->or()
                            ->word_char();

    Wrapping the entire alternation in group() or some other Grouping method is a very good idea, as you might want to use a Quantifier or something more complex:

            my $re = Regexp::English->new()
                    ->remember()
                                    ->literal('root beer')
                            ->or
                                    ->literal('milkshake')
                    ->end();

    If you find this onerous, you can also pass arguments to or(), which will group them together in non-capturing braces. Note that you will have to import the appropriate functions or fully qualify them. Calling these functions as class methods may not work reliably anyway. It may never work reliably. Properly indented, the method interface looks nicer anyway, but you have two options:

            my $functionre = Regexp::English->new()
                    ->or( Regexp::English::digit(), Regexp::English::word_char() );
    
            my $classmethodre = Regexp::English->new()
                    ->or( Regexp::English->digit(), Regexp::English->word_char() );
  • debug()

    Returns the regular expression so far. This can be handy if you know what you're doing.

  • capture()

    Performs the capturing logic. You probably don't need to know about this, but it's fairly cool.

EXPORTS

By default, there are no exports. This is an object oriented module, and this is how it should be. You can import the Quantifier and Grouping subroutines by providing the :standard argument to the use() line and the Character methods with the :chars tag.

        use Regexp::English qw( :standard :chars );

You can also use the :all tag:

        use Regexp::English ':all';

This interface may change slightly in the future. If you find yourself exporting things, you should look into Vahe Sarkissian's upcoming Regexp::Easy module. This is probably news to him, too.

TODO

  • Add not()

  • More error checking

  • Add a few tests here and there

  • Add POSIX character classes ?

  • Delegate to Regexp::Common ?

  • Allow other language backends (probably just add documentation for this)

  • Improve documentation

AUTHOR

chromatic, chromatic at wgz dot org, with many suggestions from Vahe Sarkissian and Damian Conway.

COPYRIGHT

Copyright (c) 2001-2002, 2005, 2011 by chromatic. Most rights reserved.

This program is free software; you can use, modify, and redistribute it under the same terms as Perl 5.12 itself.

See http://www.perl.com/perl/misc/Artistic.html

SEE ALSO

perlre