NAME
Text::Glob::DWIW - Yet another Text::Glob{::Expand,}
SYNOPSIS
use Text::Glob::DWIW ':all';
say for textglob_expand 'glob{b,al replac}ing',
'Text{[-_],::}Glob{[-_],::}{DWIW,DoWhatIWant}';
my @r=textglob_grep 'a*c', qw(...);
DESCRIPTION
Text::Glob::DWIW
implements glob
(3) style expansion and also matching against text. If you want to look at usage examples first, jump to the textglob_expand
explanation at the start of the FUNCTIONS section.
WHY
Some modules targeting that matter already exists on CPAN, e.g. for expanding Text::Glob::Expand
and String::Glob::Permute
, and also a handful for matching. Moreover perl itself came with two variants of globbing - < >
aka glob
and bsd_glob
from File::Glob
, a core module - which can be (mis)used for text expansion also.
Because of that already existing plurality, this wasting of CPANs namespace demands some explanation.
In short all considered modules missed at least one of the features I liked:
separated from file system; or the non-interacting can be ensured
character classes
recursive pattern, like nested braces
expansion
interpretation as path and the corresponding can be turned off.
simple interface, no arrogation to excessive descriptiveness
order is determined by pattern, and looks natural.
syntax is not too far from what is found in most shells, and syntax extensions are integrated harmonically.
WHEN STAY AWAY
Also this module has its issues like missing functionality or performance. To make your decisions hopefully easier a big MISSING section - with hints what you can do instead - exists. See also under SEE ALSO, where other modules are mentioned which might fit your need better.
IMPORT TAGS
No functions are exported by default. They have fixed naming schemes from which you can select one: {textglob,tglob,tg}_*
. So each function can be imported by three different names.
:textglob_ import the subroutines so they begin with textglob_... .
:tglob_ ditto, except tglob_...
:tg_ again, this time as tg_...
:textglob like :textglob_ but also import the textglob() function.
:tglob ditto, but textglob() is renamed to tglob().
:tg you can guess it
:all load all under all available names.
:use use TGDWIW { options } available
Typical usage example: use Text::Glob::DWIW qw'textglob :tg_'
.
FUNCTIONS
All the functions can be adapted by an option hash. In the following list only the long form is mentioned, and the short form is used in examples.
textglob_expand
PATTERN ...-
expands the glob following the syntax described under PATTERN SYNTAX. The interpretation can be adapted with options, which are given inside a hashref as first or last argument.
tg_expand "[z-a]" # z y x w v u t s r q ... h g f e d c b a tg_expand "[?-']" # ? > = < ; : 9 8 ... 0 / . - , + * ) ( ' tg_expand '[bcfglptwz]oo' # boo coo foo ... tg_expand 'a{b{r{a{c{a{d{a{b{r{a,},},},},},},},},},}' # "abracadabra","abracadabr","abracadab","abracada", # "abracad","abraca","abrac","abra","abr","ab","a" tg_expand '{abra,{*}cad{*}}' # 'abra', 'abracadabra'
And also subtractive patterns are available.
tg_expand '{a{b{r{a{c{a{d{a{b{r{a,},},},},},},},},},},!abrac}' tg_expand '{a{b{r{a{c{a{d{a{b{r{a,},},},},},},},},},},!*a}' # "abracadabr", "abracadab", "abracad", "abrac", "abr", "ab"
For more of that look at PATTERN SYNTAX. It should be mentioned that a small addition to the pattern can produce a big (exponential) increase of the resulting set. Some may think it should be better named
textglob_explode
.You are warned, this function is not for generating nasty big lists.
In numeric context (forcible with
int
) it returns the count of results - by expanding all of these in memory beforehand.A more kludged but also more effective counterpart (most of the time) exits:
textglob_expand_lazy
. textglob_expand(
PATTERN ...)->format(
FORMAT ...)
-
allows
Text::Glob::Expand
style decoration and capturing, with%<num>
,%<num>.<num>
etc. For details of the format string, see there.print tg_expand('{foo,bar}') ->format("'%0' is used far too often in examples\n"); print tg_expand('Robert {[A-C].,} Wilson') ->format("'%1' is the middle initial of '%0'\n"); say tg_expand('{f[o]o,b[a]r}')->format("'%1.1' = middle char"); say tg_expand('{{Cinderella}{fore},{Alice}{hind}}') ->format("%1.1 is be%1.2 the mirror."); say tg_expand('{file{0-100}}.tar.gz')->format("mv %0 %1.tgz"); my %r=tg_expand('{f[o]o,b[a]r}')->format("%1.1",{paired=>1}); # ( foo => 'o', bar => 'a' )
- Other methods
textglob_expand(
PATTERN ...)->
METHOD(
...)
-
->elems()
and@{ }
list the result in the same form as called in list context, by ignoring thetree
orchunk
option.->chunks()
,->tree()
is equivalent to using the option with the same name.->size()
andint( )
return the count of expanded elements.It can be used as a very basic iterator with
->next()
,$$o
or< >
. Besides interchangeability withtextglob_expand_lazy
the worth of iterators hereby is low because the whole result is created in one go.Only
textglob_expand
can return an object. No further OO interfaces for other functions are actually implemented. textglob_match
PATTERN STRING ...-
returns whether the string matches. In list context more strings could be tested at once.
tg_match 'a*','aaa' # true tg_match 'a*',qw'aaa abc a b' # 1 1 1 0
textglob_grep
PATTERN STRING ...-
returns the strings which match the pattern.
tg_grep 'a*a', qw'aa abba abracadabra' # finds all
textglob_glob
PATTERN STRING ...-
works like
textglob_grep
but also adds expansions from wildcardless sub-patterns. textglob
PATTERN ortextglob
ortextglob
PATTERN STRING ...-
The one-argument form acts like
textglob_expand
. The parameterless variant also, but uses$_
instead. In the other cases it mimicstextglob_glob
. Many people like such all-in-one functionality, but it's not everyone's cup of tea. If you prefer less magic use the more explicit variants. textglob_re
PATTERN ...-
transforms the glob into a regexp.
textglob_foreign
PATTERN ... CLASS-
transforms the glob into other classes. So for example
Text::Glob::Expand
has caching and can manage slightly more data. I believe that the connection to TGE is stable but has some restrictions as explained later on.my $obj=tg_foreign '[[:card:]]#2' => 'Text::Glob::Expand'; say join ',',$obj->explode;
If this is not sufficient, you can try to use modules which apply index arithmetic to calculate the values and thus offer random access without the need of calculating interim values.
my $obj=tg_foreign '{1-100}#10' => 'Set::CartesianProduct::Lazy'; say $obj->count; # 1e20 say join ' ',$obj->get(1234567890); # 1 1 1 1 1 13 35 57 79 91 my $obj=tg_foreign '{1-100}#10' => 'List::Gen'; say $obj->size; # 1e20 say join ' ',$obj->get(1234567890); # 1 1 1 1 1 13 35 57 79 91
Don't expect too much here. Whether you combine the strengths or the weaknesses of the two modules depends heavily on the pattern. The non-globbing backends are more a proof of concept to show what could be possible.
Transformations are defined for
Set::CartesianProduct::Lazy
,List::Gen
,HOP::Stream
,Set::CrossProduct
,Iterator::Array::Jagged
andText::Glob::Expand
.Options can be given as first argument, or immediately before the CLASS.
Here the following restrictions apply:
No support for explicit anchors
ranges, wildcards and subtractive patterns are pre-resolved by this module. Depending on their position and kind most of the work is done by this module. Thus no advantage can be gained through the use of a more powerful backend.
currently for non-globbing backends the recursion is resolved, even in cases where the backend understood it.
misinterpretation is more likely (for non-globbing backends).
A look at the lazy generation point under what is missed might be interesting.
textglob_foreign
PATTERN ... ITER_TYPE- aka
textglob_expand_lazy
-
textglob_foreign
can also generate an iterator. The same restriction applies as above.To stress the point, this function is only partly lazy: Only the outermost expansion layer is handled lazy at most.
Following ITER_TYPEs are supported:
'CODE' returns a closure as iterator 'REF' a reference to an auto-advancing scalar(/array) '++' simple(minded) iterator: while (++$i) {...$$i...} also in stock: while (defined(my $i=<..>)) {...} CALL => \&sub iterate over and call sub each time 'SIZE' calculated size only 'Iterator::Simple' instance of that class 'Iterator::Simple::Lookahead' -"- 'Iterator' -"-
textglob_expand_lazy
assumes in the one-argument form '++
'. Besides that no difference totextglob_foreign
exists. Hereby the offered builtin iterator (++
) should be able to mimicry different styles of iterator interfaces and therefore should be combinable with a wide range of libraries offering 'list' processing.Also note that the object returned from
textglob_expand
exports basic iteration support$$o
(like REF-mode) and< >
. textglob_options
HASH-
For more details see Setting of Options.
PATTERN SYNTAX
- alternation
{aa,bb,cc}
- aka brace expansion
-
tg_expand 'Read The F'. '{ine,unded,a{{bul,m}ous,ntastic},{ascinat,\*}ing} Manual' tg_expand '{m{eo,iao,e}w,purr...}m{eo,iao,e}w' # concert at home
generates all possible combinations. The pattern
{a,b}{0,1}
results in the lista0
,a1
,b0
andb1
. This means the number of resulting elements is the product of the element count of all sets. Be aware of that and take care.A
{
or,
may be quoted with a backslash to prevent it from being considered part of a brace expression. An alternation set may contain others. - alternation ranges
{aa-zz}
- aka sequence expression
-
tg_expand '{aunt-away}' # like perl's range op: aunt aunu .. tg_expand '{y0-2b2}' # more: y0 y1 .. z0 z1 .. 1a0 1a1 .. tg_expand '{100-1,zero}' # countdown tg_expand '{\-5-5}' # -5 -4 -3 -2 -1 0 1 2 3 4 5 tg_expand '{0.00-20.00}' # 0.00 0.01 .. 19.98 19.99 20.00 tg_expand '{(1)-(1_001)}' # (1) (2) .. (999) (1_000) (1_001) tg_expand '{*******-*}' # ******* ****** ***** **** *** ** * tg_expand ':{,\----------})' # pinocchio, animated
Ranges with negative numbers are only defined for integers (only sign and digits). Punctuation characters must be used equivalently otherwise the result is undefined.
Some shells use
..
instead of-
, however AFAIK all use-
for character ranges. In this module a decision for unification was made.Important: Be careful not to create a range unintentionally.
- alternation ranges with step size
{aa-zz-5}
-
tg_expand '{auto-bane-1000}' # auto awga axsm azey tg_expand '{001-100-9}' # 001 010 019 028 ... 082 091 100 tg_expand '{-10-10-2}' # -10 -8 -6 -4 -2 0 2 4 6 8 10 tg_expand '{-10--20-2}' # -10 -12 -14 -16 -18 -20
The step size must be a decimal integer value greater than zero. This means
tg_expand '{a-z-0}'
is interpreted as ranging froma
toz-0
. The step size for descending ranges is only well defined if the start- and end-point are part of the result set. For other possibilities to construct ranges seeList::Maker
and the powerful and comprehensiveList::Gen
. - alternation subtraction
{!bb}
&{!mm-qq}
-
remove matching elements from the expansion set inside the same scope of braces. In other words, this operation is restricted to the nearest surrounding braces.
tg_expand '{0-20,!*[13579]}' # 0 2 4 6 8 ... 16 18 20 tg_expand '{[a-d][a-d][a-d][a-d],!*{a*a,b*b,c*c,d*d}*}' # permute
Side note: The above example is for syntax demonstration only. For calculation of permutation you better use
Algorithm::Permute
or such. - character classes
[asdf]
-
tg_expand '[bcdhlmprt]uff' # ... ruff tuff
One of the characters matches or else has its place in the generated set.
Consider:
[aaa]
and[a]
are not the same in the expanding case, the first deliversa
,a
,a
analogue to{a,a,a}
ora{,,}
. Also empty character classes are allowed. Therefore[][]
has no special meaning - two nothings -, this entails that you must quote the containing closing bracket[\][]
.Note: Nothing is a little bit imprecise,
{}
and[]
represent the set with the one element of the zero-length string''
. If you need cross products with empty sets then look somewhere else, for example toSet::Scalar->cartesian_product
,Set::CrossProduct
orList::Gen->cross
. - character ranges
[a-z]
-
The one character wide counterpart of alternation ranges.
tg_expand '[1-357-9]' # 1, 2, 3, 5, 7, 8, 9 tg_expand "[\0-\40]" # "\0", "\1", "\2", ... ' '
Whereas
tg_expand "[\t- ]"
results only in"\t", "\n", "\N{VT}", "\f", "\r", " "
. When the start and end point belong to the class of printable, whitespace or alarm bell, then the generated output is also restricted to this. - predefined char classes
[ [:upper:] ]
-
tg_expand '[aeiou[:space:]]' tg_expand '[[:punct:]][[:punct:]]' # potential twigils
The following predefined classes are supported and in case they do not constitute a very narrow set, they are restricted to the ASCII range.
[:digit:] cumbersome way to write [0-9] [:xdigit:] [0-9a-f] [:punct:] punctuation chars ala POSIX (locale ignored) [:space:] whitespace [:blank:] "\t" & " " [:lower:] [a-z] [:upper:] [A-Z] [:alpha:] [A-Za-z] [:lowernum:] [a-z0-9] [:uppernum:] [A-Z0-9] [:cardsym:] spade heart diam club, both colors black first [:card:] playing cards [:die:] all sides of a die [:chess:] chessmen, white makes the first move [:mahjong:] tiles of mah-jong [:trigram:] base for the i ching [:zodiac:] signs of zodiac [:note:] musical notes [:smiley:] the unicode consortium knows about 59 different emotions expressible by a circle with points and lines in it. [:planet:] symbols of planets of our solar system, with the sun (a star) so the name is not really correct. [:polygon:] triangle,quadrangle,pentagon & hexagon [:legal:] the sign for (C),(R),(P),SM,TM. [:roman:] roman numerals (but not the ASCII substitute)
With a
-
at the end the generating order is reversed.[:digit-:] [9-0]
- subset from predefined char classes
[ [:lower4-6,20:] ]
-
[:lower12-14:] l, m & n [:cardsym1-4:] black suit [:cardsym6,2:] red and black heart [:card1:] ace of spades [:card1,1,1,1:] swindler [:lowernum-27:] [9-0] again
The numbering begins with one, so
[:luck1-:]
is identical to[:luck:]
, and[:luck-1:]
is identical to[:luck-:]
. This numbering scheme has the pitfall that[:digit1:]
is0
. Yet starting with one is more natural in most cases.tg_expand '[[:card1-11,13-25,27-39,41-53,55-56:]]' tg_expand '{[[:card1-56:]],![[:card12,26,40,54:]]}' # both: all cards without jokers and knights
- pattern quantifier
{ }#9
,{ }#0-9
- and further
[ ]#9
,[ ]#0-9
- aka list exponentiation
-
tg_expand '[01]#8' # 0..255 in binary tg_expand '[abc]#0-1' # optional, same as {,[abc]} tg_expand 'AB{inside comment}#0BA' # ABBA tg_expand ':[-]#0-8)' # pinocchio again tg_expand '# {-=}#38-' # decoration line tg_expand '[a]#10' # by the doctor tg_expand '1[_,]#0-1\200' # =1{,[_,]}200: 1200 1_200 1,200
A
{
pattern}#
n is the same as repeating the pattern n times ({
pattern}{
pattern}
...). Being an expander feature it needs a finite upperbound. If you need more power, theRegexp::Genex
module is worth a try.For matching using the builtin Regexp is preferable. Most widely known are the (non-expanding) ksh style variants
?()
,*()
,+()
,@()
and!()
. The use of#
maybe looks familiar to zsh users, but the meaning is different to the (yet another) matching only extension from zsh. - element repeat
{ }##9
or[ ]##9
-
tg_expand '[01]##8' # only: 00000000 11111111 tg_expand ':[-]##8)' # pinocchio, unanimated tg_expand '# {-=}##38-' # as # tg_expand '[a]##10' # as # tg_expand '{([_]#0-3)}##2' # S-XXL tg_expand '{[a-d]#4,!{*[a-d]}##2*}' # permutation again
Not the pattern, the element is repeated. Where
[ab]#2
producesaa
,ab
,ba
andbb
, the same as[ab][ab]
;[ab]##2
produces onlyaa
andbb
, here only the resulting element is duplicated. Mnemonic: the repetition is done later so two#
. - wildcards
?
,*
,**
&***
-
* Match zero or more of characters, except those listed in unchar=>'...'. Also tests against condition of unhead=>'...' ? Match a single character, honors unchar- & unhead-option. *** Match any string of characters by ignoring unchar & unhead. ** All letters (not restricted by unhead) are allowed inside, but bordered by 'unchar' against other characters. This whole-parts-only resembles multiple directory semantic. Fallback to *-behavior if unchar is not set.
The wildcards have slightly different behavior if matching, subtracting or expansion. In expansions it stands only for a single best-fitting value instead of all.
For better understanding of the difference between
**
and***
, here a description of how to replace one by the other:*** {*,*/**,**/*,*/**/*} # unchar=>'/' assumed, and unhead unset ** {/,/***/} # ignoring cases at the start or end
Using
tg_grep {unchar=>'/'},'a**d',
... would matcha/d
,a/b/d
,a/b/c/d
and so on, but notad
,ab/d
,ab/cd
. The varianta/**/d
additionally doesn't matcha/d
. Whilea***d
matches all the examples above. The**
-behavior (when{unchar=>'/',unhead=>'.'}
options are set) is comparable to that in many shells (aftershopt -s globstar
is applied). . Some examples: assuming we have a list of paths - domain familiar to the majority - andunchar
is set accordingly:/** or /*** match absolute paths ?*** match relative paths (whereas ?** = {?,?/**}) **/ or ***/ any paths ending in / (aka marked directories) **file that file anywhere ***ext file with that ext, wherever **file.* that file with whatever extension anywhere dir** dir and all its subdirectories with all files inside dir/***ext all files with that ext under that dir or its subdirs **subdir/* all files inside such named subdirs **subdir** subdir and everything beneath
See further in the option section for adaptable behavior, e.g. through options like
unchar
andunhead
. - the escape character
\
-
The backslash
\
forces the following (meta)character to loose its special meaning, so that it is used verbatim. - word splitting (alternatives for csh'ish space separator)
-
Instead of space separator from the original csh glob facility, you can use:
textglob '{foo,bar}' textglob [qw'foo bar']
- normal text
-
The rest, this includes space and parentheses, and per default also slash, tilde, equal sign and (leading) dot constitutes normal text. But some of the option switches allow a more shellish handling.
- anchors
-
Per default the pattern is implicitly anchored at both sides. Besides using
*
....*
for suppression, ananchored
-option exists.tg_grep 'jam', qw'pyjamas jamboree',{anchored=>0}
If you are feeling lucky you can try another very experimental feature of explicit anchors. These can be turned on with
anchors
.tg_expand 'for{$,,ever and }ever',{anchors=>1} # for, forever, forever and ever tg_expand 'flop{^,$}flip',{anchors=>1} # flip, flop
It is important to note that in case of use for matching the start anchor
^
has the restriction that only variable length pattern which can go down to zero are allowed to precede.tg_options anchors => 1; tg_grep 'flop{^,$}flip',qw'flip flop' # only a flop tg_grep '*{/a/,^}bla', qw'where/ever/a/bla bla' # works
However such limitation does not apply to the end anchor
$
. (The acting of the end anchor$
is more consistent to the use for expansion.)tg_grep 'for{$,,ever and }ever', # fine, match all 'for','forever','forever and ever'
Don't jumble the two options! They have very different effects.
OPTIONS
Options influence the behavior and extend the adaptability and thereby the range of application/usage opportunities.
Setting of Options
Options can be supplied directly to the function call or already when loading the module. So you don't have to repeat it if you use the same options in row.
use Text::Glob::DWIW ':all', { unchar => '/' };
tg_options { case => 0 }; # tg_options case => 0; also works
say for tg_grep { anchored => 0 }, 'falling stars', ...;
- Appended to the
use
statement -
Hereby options must be specified as hash reference at the end. This method only allows constant (compile-time known) values. The options act in all function calls which are inside the same lexical scope as the
use
statement. Declaring anotheruse
in an narrower scope can be done. These options are only set once at compile time, and therefore don't reset if the program flow arrives at them another time. The combined behavior withtextglob_options
call (from inside the same scope) is loosely comparable withstate
variables.As shorthand notation - instead of always writing out the full package name - the tag
:use
can be added to the first import. After doing that,use TGDWIW { }
is available alternatively. - Through the
textglob_options
function -
Here the validity is the scope of the next outer
use
statement. A restriction to constants doesn't exist. If needed ause
clause (with or without options) and a followingtextglob_options
can be combined. - Directly supplied to the function call
-
The options must be handed over as the first or as last parameter in the form of a hash reference
{ }
. Options are only considered for that function and override options set otherwise.
Warning: The presetting capabilities works only by use of explicit use
without scope related indirections.
General Options
quant
(default: '#,##
')-
The quantifier ...
#
n-
m and ...##
n can be turned off, then a#
behind{ }
or[ ]
acts as a normal character. range
(default: '{},[]
')-
The
{0-100}
and[a-z]
can be turned off. Then the hyphen-minus (-
) is handled like a normal character. charclass
(default: 'def1,sort0
')-
Some character class features are also switchable. E.g. the predefined character classes
[[:punct:]]
can be turned off with{charclass=>'def0'}
. The result is then like the feature doesn't exists. For example[[:punct:]]
is interpreted as a char class with[\[:punct:]
and a following]
which generates[]
,:]
,p]
,u]
, ...t]
,:]
.Some shells generate brace sequences in natural order, but sort the contribution from char classes in ascending order. With
{charclass=>'sort+'}
this can simulated, andsort-
is for descending order. minus
(default:1
)-
The subtracting with
{ ,! }
can also be turned off. anchors
(default: '')-
Basic support for explicit anchors exists. This feature is known to be buggy, and is therefore turned off by default. Turn it only on if you can not live without it.
But maybe you have searched for the
anchored
-option anyway, which can be found in the following section about options for matching. tilde
(default:undef
)-
Through this option the handling of tilde expansion is available:
say tg_expand '~{he,she,it,sking}/path',{tilde=>'/home/'};
More powerful possibilities are offered by using coderefs:
sub tilde_expand ($$$) { my ($what,$arg,$delim)=@_; my $nyi= $what eq '~' && $arg!~/^[+-\d]/ && $delim=~qr'^/?$'; return unless $nyi; # don't change File::HomeDir->${$arg eq '' ? \'my_home' : \'users_home'}($arg) } say tg_expand $p='~{he,she,it,sking}/path',{tilde=>\&tilde_expand};
Typical meanings (mentioned here only so you know what is your part ;-):
~user, ~{user} File::HomeDir->users_home($user) ~ File::HomeDir->my_home ~- $ENV{OLDPWD} # or whatever is available in perl ~+n (`dirs`)[$n] ~-n (`dirs`)[-$n-1] =file File::Which::which($file)
The subref/closure variant is not combinable with the
tree
orchunk
option. It is also not available in combination with the object interface. This matches only at the beginning of patterns. But differently to shell behavior a path separator sign (e.g.:
under Unix) is not honored. Split it yourself beforehand. break
(default:0
# =off)-
Too easily big lists can be generated by simple patterns.
tg_expand '[0123][0123][0123][0123][0123]',{break=>1000}; # die
This option allows to set an upper bound for the size of a generated list. It
die
s if this limit is reached. Use it in aneval
block for catching if you turn this feature on.Some assumptions are made:
Only sets which are going to be constructed are handled. The reasoning is that for matching, more complex patterns are processable, and so the are accepted.
Size of interim sets are checked.
Checks are only done when: a quantifier is used, a cross product happens, and by ranges.
Ranges are often only roughly guessed, ...
No analysis of cost is done, so
{aaaa-zzzz}
is considered to have the same costs as[a-z][a-z][a-z][a-z]
, and even{[a-z][a-z][a-z][a-z],![a-z][a-z][a-z][a-z]}
.Only the growth of elements counts, and not the growth of the size of a single element is considered. Here the important exception is
##
n. Without this the value of the whole option would be questionable.The size of a single element from the input always counts as one.
So the last point means, that you must also restrict the length of the input field! Otherwise:
tg_expand 'a'x10_000_000,{break=>1} # no die, the death himself
The value you should set for
break
depends on the power you have. In the following, values are from a weak machine and should be considered as a starting point. A value between 1000 and 3000 seems reasonable, if you forbid the subtractive pattern by setting{minus=>0}
. With this costly feature enabled, a value of 100 seems to fit better.Set also the
stepsize
-option to a reasonable value e.g. -100.And please don't rely on this feature. This is most likely not ready for security sensitive production environments! Maybe combining it with modules like
Time::Out
helps. stepsize
(default:0
# =on, without restriction)-
Ranges with step size can be turned off or limited.
undef
: If set toundef
the step size feature is completely turned off. Then step sizes are not recognized as such and the appendage is interpreted as a part of the range's end point.0
: A value of0
means no limit.>0
: If a number greater zero is set, then this is the maximal allowed step size. If this size is exceeded, an exception is thrown. Seeeval
in perlfunc for handling.<0
: If a negative number is given, this is a kind of soft limit, that influence the internal element count prediction. This has only an effect ifbreak
is also set.tg_expand '{1-100-100}',{stepsize=>-10,break=>10} # '1' tg_expand '{1-100-100}',{stepsize=>-10,break=>9 } # die
Note: Actually for non integer ranges an extended magic increment is used, which can get CPU intensive if big steps are used. So one of the reasons for restriction is that 'magic' arithmetic operations are not yet programmed, and so delegation to repeated increments is used.
Options which are specific for Wildcards
star
(default: '?,*,**,***
')-
The wildcards
***
,**
,*
and?
can also be turned off. With{star=>0}
this symbols stop to be special and are taken verbatim. With{star=>1}
they could later be brought back. Also selecting selectively is possible. twin
(default: '**,***
')-
For degrading the twin star
**
- and the triplet star***
-wildcard. With{twin=>0}
usage these act like normal stars. It is sometimes called globstar.As surplus
{twin=>'**+'}
switch**
to***
behavior and{twin=>'***-'}
switch***
to**
behavior. For complete switch off, see thestar
-option. unchar
(default: '')-
All inputs are equal. But here you can define 'non grata' chars, which are not matched by the wildcards
*
and?
. This setting is ignored by the***
-wildcard, and has special meaning for the**
. If unset - the default - then***
,**
and*
act identically.textglob 'fo*ba*','foo/bar','foobaz',{unchar=>'/'} # foobaz only
If the argument to the option looks like a character class, the interpretation is likewise.
If you don't want matching multiline texts, use
unchar=>"\n"
. unhead
(default: '')-
If the string starts (or continues after an
unchar
) with one of the characters mentioned in theunhead
-option, it is hidden from the result except it is explicit in the pattern.Under Unix files beginning with a leading dot are called hidden. They are second class citizens (or are they éminence grise?) which are only visible if explicitly requested.
textglob {unhead=>'.'},'*', qw'. .. .bashrc fine your.txt' # find: fine, your.txt
So
{unhead=>'.',unchar=>'/'}
serve dotglob behavior.
Options for Expanding
tree
(default:0
)-
Instead of a list of text, returns a list of listrefs where braces subgroups are itself refs (natural list refs or scalar refs as marker). This can be seen as an alternative to the
->format
feature. Compare it to the capturing feature for matching.tg_expand 'a{b,c}d',{tree=>1} # ['a',\'b','d'], ['a',\'c','d']
chunk
(default:0
)-
Instead of a list of text, returns a LoL structure, where the listrefs hold the ordered single chunks. This and the previous feature are only available for
textglob_expand
.tg_expand 'a{b,c}d',{chunk=>1} # [qw'a b d'], [qw'a c d']
Options which influence Matching
case
(default:1
)-
Normally matches are case sensitive
{case=>1}
. But you can chose to ignore case, with{case=>0}
. Beside that, a extended case mode{case=>2}
exists, where uppercase characters match only uppercase, but lowercase match both. This mode is best known from search engines. Then an uppercase variant{case=>3}
exists where uppercase letters match both.my @v=qw'ABC abc Abc aBC aBc Abd'; tg_grep {case=>0}, 'Abc', @v # all except Abd tg_grep {case=>1}, 'Abc', @v # only Abc tg_grep {case=>2}, 'Abc', @v # ABC Abc tg_grep {case=>3}, 'Abc', @v # abc Abc tg_grep {case=>-1},'Abc', @v # aBC tg_grep {case=>-2},'Abc', @v # ABC aBC tg_grep {case=>-3},'Abc', @v # abc aBC aBc
A mode for people with defect shift key
{case=>4}
, where every first character of a word if lowercase, match both.tg_grep {case=>4}, 'abc', @v # abc Abc tg_grep {case=>-4},'ABC', @v # abc Abc
Beside that also a CamelCase mode
{case=>5}
exists:tg_grep {case=>5}, 'CamelCase',qw'CamelCase camel_case camelcase' tg_grep {case=>-5},'CamelCase',qw'cAMELcASE c_a_m_e_lc_a_s_e cc' # find first and second
anchored
(default: 'a,z
')-
Normally pattern search is done by testing if the whole string fits. By turning off anchoring a part of the string is enough for matching. This is useful if you like to combine parts, because enclosure with
*
doesn't help with that. Single sided anchoring is available by setting the option to^
or$
, or by setting toa
orz
.my @horoscopes= ... .. $astro=~/${ \tg_re '[[:zodiac:]]*', {anchored=>0,greedy=>1,unchar=>'[[:zodiac:]]'} }/g; .. $astro=~/${\tg_re'[[:zodiac:]]',{anchored=>0}} [\pP\w\s]*/xg; .. split /(?=${\tg_re '[[:zodiac:]]',{anchored=>0}})/,$astro;
Of course for that simple case, you can write:
my @horoscopes=grep !/^.$|\Q$astro/, $astro=~tg_re '{[[:zodiac:]]*}#12',{capture=>1};
The
Interpolation
module is recommended as assembly adhesive. If you only want to pimp up your REs, have a look atRegexp::Common
. invert
(default:0
)-
Inverts the matching. Only for use with
textglob_match
andtextglob_grep
. It is also fine fortextglob_glob
andtextglob
, so long as you use that because of their shorter name. But in the cases where matching is mixed with expansion, it is unlikely to do what you want. capture
(default:0
)-
has only meaning for
textglob_re
. Through that{}
and[]
act as capture groups.my $re=tg_re 'A {v* ,}{*} story',{capture=>1}; my @r='A very short story'=~/$re/; # 'very ', 'short'
It interacts slightly with
rewrite
. You can usegrep defined,(
...=~/$re/)
to equalize the differences between these modes.A common interface between expanding and matching would be nice, but OTOH that way it was easy to implement. It's here because it was easier to code, as to explain why it is left out. This option is likely to change or disappear in future.
greedy
(default:0
)-
Default behaviour for
*
,**
and***
is non-greedy (0
), you can switch to greedy (1
) and possessive (2
).'eggshells'=~tg_re '{egg*s}*',{greedy=>0,capture=>1} # eggs 'eggshells'=~tg_re '{egg*s}*',{greedy=>1,capture=>1} # as is tg_match 'sim*.bim', 'simsala.bim',{greedy=>2,unchar=>'.'} # 1 tg_match 'sim***.bim','simsala.bim',{greedy=>2,unchar=>'.'} # 0
Best you forget that this option exists. (Consider using Regexp.)
Esoteric Options
last
(default:1
)-
The unescaping/dequoting of this module mostly follows filter semantics. So different kinds of data processing can be stacked together. Normally the escaping of the escape, so that that is verbatim, in our case a backslashed backslash
\\
, should be only removed from the last stage. So usage requires no knowledge of the filter stack depth. So composited tools can be seen as a blackbox. If this module is not the last stage, you can set this option to 'off'last=>0
, then\\
would not be dequoted, an the protected and the protecting backslash would be handled down as they are. This applies only to expansion. rewrite
(default:0
for expand,1
for matching)-
Instead of expanding, the pattern is only rewritten to a normalised, simpler form. This is the default interim format for matching.
tg_expand 'foo{[ab][01]}#2{[ab][01]}##2ba[rz]',{rewrite=>1} # foo{{{a,b}{0,1}}{{a,b}{0,1}}}{a0a0,a1a1,b0b0,b1b1}ba{r,z}
If necessary - for
##
element repeat or!
... subtraction - the pattern is partly expanded. Also thelast
-option is ignored, and always off. The wildcards are transferred as is, so under expansion thestar
-option is meaningless.It is useful for debugging to see the pattern differently or to detect if
rewrite=>0
changes what is matched, however it shouldn't.Another use case is feeding the rewritten pattern to another module which understands basic patterns, but you prefer the fancy ones.
backslash
(default: '')-
In combination with
unchar
,backslash
can be used to allow a preceding backslash (in the text domain) to disable that special meaning. Besides that, the backslashed sequence counts as a single char for the?
-wildcard. Remember that this option only inflects wildcards and so the backslash must be written out in explicit parts of the pattern.Unstable, and candidate for removal.
my @v=("ab", "c\nd", "e\\\nf"); tg_grep '*',@v,{ unchar=>"\n",backslash=>"\n" } # "ab", "e\\\nf" tg_grep '???', @v, { backslash=>1 } # "c\nd","e\\\nf" tg_grep "?\\\\\n?", @v, {backslash=>...} # "e\\\nf"
The dequoting in pattern space is not changed in any way. Don't allow this option to confuse you.
ERRORS
The following error messages are defined and are thrown in the respective condition.
Useless call of
...in void context.
-
The function is called without having the possibility to return a result.
Unknown option
....
-
The given option isn't understood.
Error in option setting: Scope of use declaration not found.
-
You have loaded this module by something other than a normal
use
statement. In such a case atextglob_options
call can trigger this error. Add an explicituse
before. Otherwise you are restricted to feeding the options directly. Too much (>
...).
-
If
break
is set, and that limit is reached. Step size too wide (>
...).
-
If
stepsize
is greater than zero, and that limit is reached. Can't load
...<!>Unknown module
...requested.
-
textglob_foreign
doesn't know or has trouble to load the requested module.
PITFALLS
Because of the pragmata-style capability of lexical-scoped presetting options, the following incompatible constructs are not supported in these regards:
{ use Text::Glob::DWIW ...; } ...
# outside the scopeuse Text::Glob::DWIW ();
# preset feature also turned offrequire Text::Glob::DWIW;
# not even turned oneval "use Text::Glob::DWIW ...;" ...
If options are set in such situations, they are silently ignored. Furthermore textglob_options
called in such context and without an existing upper scope declaration will throw an exception.
Note: In the case that some programmatic control over module loading is needed, you can use use if $test, ...
and use maybe ...
.
CAVEATS
It is assumed that only small patterns are typically used. No optimisation for speed or against memory exhaustion is considered.
Nearly no error handling and recovery is built in. If you feed garbage, you get garbage back - most of the time. This do-the-next-best-thing strategy also means that no forward compatibility exists. So most likely your code must be adapted for new releases.
Instead of a clear design, this module was developed in a more dirty and hackish way. So regexps and inbound signaling are heavily used. Mutual recursion is used in such excessivity, that the resulting code convolution is best called higher order spaghetti.
BUGS
Pretty sure (see design caveats ;-). Anyway, if you catch one, mail how to reproduce it, what you got and what you expected. And maybe on what you rely on that it doesn't change, as an action-result pair.
MISSING
This module may have some advantages over TGE and SGP. I wrote it to get a glob expander which possesses that particular features. The only reason why I hacked the matching features in, was my disliking of such a longish name like Text::Glob::Expand::DWIW. So the non expander functionality is a bit rudimentary.
Especially negative character classes would be useful.
- Negative character classes
[!ab]
-
Not yet implemented. Sorry for that.
- Special Treatment of
..
&.
-
tg_grep '.*',{unhead=>'.'}
match.
and..
Most shells allow to suppress this behavior. You can add an extra layer for filtering these out:tg_grep {invert=>1,unchar=>'/'},'**{.,..}**', tg_grep ....
- Understanding of Path Syntax
-
- repeating slash
-
/usr//tmp/*
(cleanup input instead). Look at the->cleanup
-method which is offered byPath::Class
. This can also help with the following points. - current directory
.
-
Replace
/./
with/
and remove./
at start and/.
at end beforehand. - parent directory
..
-
Remove
/*/..
repeatedly and also*/..
from the start. - volumes
-
Depending on your needs replace
'D:foobar'
withCwd::getdcwd('D:').'\\foobar'
or'D:**\\foobar'
- csh'ish empty
{}
-
No special, write explicitly
\{\}
. - Independent capturing support
( )
-
exists neither for matching nor expanding. But an every-
{}
-and-every-[]
-is-a-marker/selector is available. If thecapture
-option is not enough or too cumbersome, use Regexp. This is what they are for. For expanding only the following, restricted possibilities exists: the search through the result sets of thetree
-option as one option, and the Text::Glob::Expand-liketg_expand(
...)->format(
...)
method the other. But to emphasize: You have to know your pattern because every{}
and[]
is marked or selected. - lazy generation
-
If you have to deal with patterns that produce big result sets (and you don't like to experiment with less stable parts of this module, like the half-baked for demonstration purposes only functions
textglob_foreign
andtextglob_expand_lazy
), then sorry this module is definitely not for you.I thought about it, especially about doing it with index arithmetic, which allows random access without the need of holding anything of the result in memory. See
textglob_foreign
for a restricted example with the help ofSet::CartesianProduct::Lazy
.Recursive patterns shouldn't be a problem. But for magic ranges basic arithmetic operations are needed. Also subtractive patterns and wildcards which match the actual expansion set are at least difficult, maybe even impossible to solve directly. Of course an extra layer of memory-friendly hole and insertion store are possible. I hope you understand that this sounds too much as too much work. So my decision fell on the side of let-it-be instead of do-it-right.
Nevertheless it would open cool opportunities:
xx_expand('{1-*}[abc]{1-*}{1-*-2}')->[100*Inf**2+100] # result of this hypothetical routine would be: 33b1200 ;-)
Anyway I have never used globs for more than generating 1000 elements. (hmmm, maybe even only 100. But I also never tried to backup all my files in my inbox along with all that mails with big attachments in the same place. So in this 'modern' world my computer usage appears to be untypical. Ok I'm wrong, the youngsters today store the data on dropbox or skydrive make a youtube video about it and use then an url shortener for posting on facebook. This way they can be sure that a north-american suction agency makes a backup. But with backup generally: how you get it back when you need it? (Ok, maybe an additional backup onto the gmail account helps.))
- Sorting option
-
Sorting can be done afterwards, and is an independent functionality - at least as long as it is not depending on the pattern. A minimalistic version of partial sorting is added for compatibility reasons
{charclass=>'sort+'}
. Details can be found in the options section. - Syntax switching
-
It gives a few popular extensions like globbing of the zsh or the VMS DCL syntax e.g. triple dot
...
instead of**
. So this module has its own hard-wired syntax. (yet another. yuck.) Changing between syntaxes is offered byRegexp::Wildcards
as its main feature. - Substitution
-
Something like
tg_grep('{**}/{*}.tar.gz',...)->format('%1/new/%2.tgz',{paired=>1})
would be nice, but is not implemented. You can usetextglob_re
withcapture=>1
, and then perl'ss///
. If you have to handle files then maybeFile::GlobMapper
fulfills your needs.
SEE ALSO
- file based and in the CORE
- renowned, but matcher only
-
Regexp::Wildcards, Text::Glob, Regexp::Shellish, Regexp::SQL::LIKE
- expander
-
Text::Glob::Expand, String::Glob::Permute, String::Range::Expand, Regexp::Genex (Regexp based), Data::Generate (alienated)
- glob-based filename substituter
-
File::GlobMapper which is part of IO::Compress, File::Wildcard
- lightweight named capturing matcher
- list modules with fuze
-
List::Gen (swiss army knife), Set::CartesianProduct::Lazy (lightning), Set::CrossProduct (iterating), Iterator::Array::Jagged, Math::Cartesian::Product
- list comprehension
- and now for something completely different
-
File::HomeDir, File::Which, Path::Class, Cwd, Time::Out, Algorithm::Permute, Set::Scalar (a real member of Set::), Interpolation, Regexp::Common, HOP::Stream, Iterator::Simple (a worthy representative for all mentioned iterator packages), if core module, maybe (nearly a philosophy: just try it)
CONTRIBUTION
Some test files t/02-*.t are borrowed/adapted from other mentioned above modules on CPAN which ones also provides glob
functionality.
COPYRIGHT
(c) 2013 Josef. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.