-
-
07 Mar 2012 13:01:05 UTC
- Distribution: Unicode-Regex-Set
- Module version: 0.04
- Source (raw)
- Browse (raw)
- Changes
- How to Contribute
- Issues (1)
- Testers (2146 / 0 / 0)
- Kwalitee
Bus factor: 1- 88.70% Coverage
- License: perl_5
- Activity
24 month- Tools
- Download (5.99KB)
- MetaCPAN Explorer
- Permissions
- Subscribe to distribution
- Permalinks
- This version
- Latest version
- Dependencies
- Carp
- Exporter
- constant
- strict
- warnings
- and possibly others
- Reverse dependencies
- CPAN Testers List
- Dependency graph
NAME
Unicode::Regex::Set - Subtraction and Intersection of Character Sets in Unicode Regular Expressions
SYNOPSIS
use Unicode::Regex::Set qw(parse); $regex = parse('[\p{Latin} & \p{L&} - A-Z]');
DESCRIPTION
Perl 5.8.0 misses subtraction and intersection of characters, which is described in Unicode Regular Expressions (UTS #18). This module provides a mimic syntax of character classes including subtraction and intersection, taking advantage of look-ahead assertions.
The syntax provided by this module is considerably incompatible with the standard Perl's regex syntax.
Any whitespace character (that matches
/\s/
) is allowed between any tokens. Square brackets ('['
and']'
) are used for grouping. A literal whitespace and square brackets must be backslashed (escaped with a backslash,'\'
). You cannot put literal']'
at the start of a group.A POSIX-style character class like
[:alpha:]
is allowed since its'['
is not a literal.SEPARATORS (
'&'
for intersection,'|'
for union, and'-'
for subtraction) should be enclosed with one or more whitespaces. E.g.[A&Z]
is a list of'A'
,'&'
,'Z'
.[A-Z]
is a character range from'A'
to'Z'
.[A-Z - Z]
is a set by removal of[Z]
from[A-Z]
.Union operator
'|'
may be omitted. E.g.[A-Z | a-z]
is equivalent to[A-Z a-z]
, and also to[A-Za-z]
.Intersection operator
'&'
has high precedence, so[\p{A} \p{B} & \p{C} \p{D}]
is equivalent to[\p{A} | [\p{B} & \p{C}] | \p{D}]
.Subtraction operator
'-'
has low precedence, so[\p{A} \p{B} - \p{C} \p{D}]
is equivalent to[[\p{A} | \p{B}] - [\p{C} | \p{D}] ]
.[\p{A} - \p{B} - \p{C}]
is a set by removal of\p{B}
and\p{C}
from\p{A}
. It is equivalent to[\p{A} - [\p{B} \p{C}]]
and[\p{A} - \p{B} \p{C}]
.Negation. when
'^'
just after a group-opening'['
, i.e. when they are combined as'[^'
, all the tokens following are negated. E.g.[^A-Z a-z]
matches anything but neither[A-Z]
nor[a-z]
. More clearly you can say this with grouping as[^ [A-Z a-z]]
.If
'^'
that is not next to'['
is prefixed to a sequence of literal characters, character ranges, and/or metacharacters, such a'^'
only negates that sequence; e.g.[A-Z ^\p{Latin}]
matchesA-Z
or a non-Latin character. But[A-Z [^\p{Latin}]]
(or[A-Z \P{Latin}]
, for this is a simple case) is recommended for clarity.If you want to remove anything other than
PERL
from[A-Z]
, use[A-Z & PERL]
as well as[A-Z - [^PERL]]
. Similarly, if you want to intersect[A-Z]
and a thing notJUNK
, use[A-Z - JUNK]
as well as[A-Z & [^JUNK]]
.For further examples, please see tests.
FUNCTION
$perl_regex = parse($unicode_character_class)
-
parses a Character Class pattern according to Unicode Regular Expressions and converts it into a regular expression in Perl (returned as a string).
AUTHOR
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
Copyright(C) 2003, SADAHIRO Tomoyuki. Japan. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
Module Install Instructions
To install Unicode::Regex::Set, copy and paste the appropriate command in to your terminal.
cpanm Unicode::Regex::Set
perl -MCPAN -e shell install Unicode::Regex::Set
For more information on module installation, please visit the detailed CPAN module installation guide.