Text::Glob::DWIW - Yet another Text::Glob{::Expand,}
use Text::Glob::DWIW ':all'; say for textglob_expand 'glob{b,al replac}ing', 'Text{[-_],::}Glob{[-_],::}{DWIW,DoWhatIWant}'; my @r=textglob_grep 'a*c', qw(...);
Text::Glob::DWIW implements glob(3) style expansion and also matching against text. If you want to look at usage examples first, jump to the textglob_expand explanation at the start of the FUNCTIONS section.
Text::Glob::DWIW
glob
textglob_expand
Some modules targeting that matter already exists on CPAN, e.g. for expanding Text::Glob::Expand and String::Glob::Permute, and also a handful for matching. Moreover perl itself came with two variants of globbing - < > aka glob and bsd_glob from File::Glob, a core module - which can be (mis)used for text expansion also.
Text::Glob::Expand
String::Glob::Permute
< >
bsd_glob
File::Glob
Because of that already existing plurality, this wasting of CPANs namespace demands some explanation.
In short all considered modules missed at least one of the features I liked:
separated from file system; or the non-interacting can be ensured
character classes
recursive pattern, like nested braces
expansion
interpretation as path and the corresponding can be turned off.
simple interface, no arrogation to excessive descriptiveness
order is determined by pattern, and looks natural.
syntax is not too far from what is found in most shells, and syntax extensions are integrated harmonically.
Also this module has its issues like missing functionality or performance. To make your decisions hopefully easier a big MISSING section - with hints what you can do instead - exists. See also under SEE ALSO, where other modules are mentioned which might fit your need better.
No functions are exported by default. They have fixed naming schemes from which you can select one: {textglob,tglob,tg}_*. So each function can be imported by three different names.
{textglob,tglob,tg}_*
:textglob_ import the subroutines so they begin with textglob_... . :tglob_ ditto, except tglob_... :tg_ again, this time as tg_... :textglob like :textglob_ but also import the textglob() function. :tglob ditto, but textglob() is renamed to tglob(). :tg you can guess it :all load all under all available names. :use use TGDWIW { options } available
Typical usage example: use Text::Glob::DWIW qw'textglob :tg_'.
use Text::Glob::DWIW qw'textglob :tg_'
All the functions can be adapted by an option hash. In the following list only the long form is mentioned, and the short form is used in examples.
expands the glob following the syntax described under PATTERN SYNTAX. The interpretation can be adapted with options, which are given inside a hashref as first or last argument.
tg_expand "[z-a]" # z y x w v u t s r q ... h g f e d c b a tg_expand "[?-']" # ? > = < ; : 9 8 ... 0 / . - , + * ) ( ' tg_expand '[bcfglptwz]oo' # boo coo foo ... tg_expand 'a{b{r{a{c{a{d{a{b{r{a,},},},},},},},},},}' # "abracadabra","abracadabr","abracadab","abracada", # "abracad","abraca","abrac","abra","abr","ab","a" tg_expand '{abra,{*}cad{*}}' # 'abra', 'abracadabra'
And also subtractive patterns are available.
tg_expand '{a{b{r{a{c{a{d{a{b{r{a,},},},},},},},},},},!abrac}' tg_expand '{a{b{r{a{c{a{d{a{b{r{a,},},},},},},},},},},!*a}' # "abracadabr", "abracadab", "abracad", "abrac", "abr", "ab"
For more of that look at PATTERN SYNTAX. It should be mentioned that a small addition to the pattern can produce a big (exponential) increase of the resulting set. Some may think it should be better named textglob_explode.
textglob_explode
You are warned, this function is not for generating nasty big lists.
In numeric context (forcible with int) it returns the count of results - by expanding all of these in memory beforehand.
int
A more kludged but also more effective counterpart (most of the time) exits: textglob_expand_lazy.
textglob_expand_lazy
textglob_expand(
)->format(
)
allows Text::Glob::Expand style decoration and capturing, with %<num>, %<num>.<num> etc. For details of the format string, see there.
%<num>
%<num>.<num>
print tg_expand('{foo,bar}') ->format("'%0' is used far too often in examples\n"); print tg_expand('Robert {[A-C].,} Wilson') ->format("'%1' is the middle initial of '%0'\n"); say tg_expand('{f[o]o,b[a]r}')->format("'%1.1' = middle char"); say tg_expand('{{Cinderella}{fore},{Alice}{hind}}') ->format("%1.1 is be%1.2 the mirror."); say tg_expand('{file{0-100}}.tar.gz')->format("mv %0 %1.tgz"); my %r=tg_expand('{f[o]o,b[a]r}')->format("%1.1",{paired=>1}); # ( foo => 'o', bar => 'a' )
)->
(
->elems() and @{ } list the result in the same form as called in list context, by ignoring the tree or chunk option. ->chunks(), ->tree() is equivalent to using the option with the same name. ->size() and int( ) return the count of expanded elements.
->elems()
@{ }
tree
chunk
->chunks()
->tree()
->size()
int( )
It can be used as a very basic iterator with ->next(), $$o or < >. Besides interchangeability with textglob_expand_lazy the worth of iterators hereby is low because the whole result is created in one go.
->next()
$$o
Only textglob_expand can return an object. No further OO interfaces for other functions are actually implemented.
textglob_match
returns whether the string matches. In list context more strings could be tested at once.
tg_match 'a*','aaa' # true tg_match 'a*',qw'aaa abc a b' # 1 1 1 0
textglob_grep
returns the strings which match the pattern.
tg_grep 'a*a', qw'aa abba abracadabra' # finds all
textglob_glob
works like textglob_grep but also adds expansions from wildcardless sub-patterns.
textglob
The one-argument form acts like textglob_expand. The parameterless variant also, but uses $_ instead. In the other cases it mimics textglob_glob. Many people like such all-in-one functionality, but it's not everyone's cup of tea. If you prefer less magic use the more explicit variants.
$_
textglob_re
transforms the glob into a regexp.
textglob_foreign
transforms the glob into other classes. So for example Text::Glob::Expand has caching and can manage slightly more data. I believe that the connection to TGE is stable but has some restrictions as explained later on.
my $obj=tg_foreign '[[:card:]]#2' => 'Text::Glob::Expand'; say join ',',$obj->explode;
If this is not sufficient, you can try to use modules which apply index arithmetic to calculate the values and thus offer random access without the need of calculating interim values.
my $obj=tg_foreign '{1-100}#10' => 'Set::CartesianProduct::Lazy'; say $obj->count; # 1e20 say join ' ',$obj->get(1234567890); # 1 1 1 1 1 13 35 57 79 91 my $obj=tg_foreign '{1-100}#10' => 'List::Gen'; say $obj->size; # 1e20 say join ' ',$obj->get(1234567890); # 1 1 1 1 1 13 35 57 79 91
Don't expect too much here. Whether you combine the strengths or the weaknesses of the two modules depends heavily on the pattern. The non-globbing backends are more a proof of concept to show what could be possible.
Transformations are defined for Set::CartesianProduct::Lazy, List::Gen, HOP::Stream, Set::CrossProduct, Iterator::Array::Jagged and Text::Glob::Expand.
Set::CartesianProduct::Lazy
List::Gen
HOP::Stream
Set::CrossProduct
Iterator::Array::Jagged
Options can be given as first argument, or immediately before the CLASS.
Here the following restrictions apply:
No support for explicit anchors
ranges, wildcards and subtractive patterns are pre-resolved by this module. Depending on their position and kind most of the work is done by this module. Thus no advantage can be gained through the use of a more powerful backend.
currently for non-globbing backends the recursion is resolved, even in cases where the backend understood it.
misinterpretation is more likely (for non-globbing backends).
A look at the lazy generation point under what is missed might be interesting.
textglob_foreign can also generate an iterator. The same restriction applies as above.
To stress the point, this function is only partly lazy: Only the outermost expansion layer is handled lazy at most.
Following ITER_TYPEs are supported:
'CODE' returns a closure as iterator 'REF' a reference to an auto-advancing scalar(/array) '++' simple(minded) iterator: while (++$i) {...$$i...} also in stock: while (defined(my $i=<..>)) {...} CALL => \&sub iterate over and call sub each time 'SIZE' calculated size only 'Iterator::Simple' instance of that class 'Iterator::Simple::Lookahead' -"- 'Iterator' -"-
textglob_expand_lazy assumes in the one-argument form '++'. Besides that no difference to textglob_foreign exists. Hereby the offered builtin iterator (++) should be able to mimicry different styles of iterator interfaces and therefore should be combinable with a wide range of libraries offering 'list' processing.
++
Also note that the object returned from textglob_expand exports basic iteration support $$o (like REF-mode) and < >.
textglob_options
For more details see Setting of Options.
{aa,bb,cc}
tg_expand 'Read The F'. '{ine,unded,a{{bul,m}ous,ntastic},{ascinat,\*}ing} Manual' tg_expand '{m{eo,iao,e}w,purr...}m{eo,iao,e}w' # concert at home
generates all possible combinations. The pattern {a,b}{0,1} results in the list a0, a1, b0 and b1. This means the number of resulting elements is the product of the element count of all sets. Be aware of that and take care.
{a,b}{0,1}
a0
a1
b0
b1
A { or , may be quoted with a backslash to prevent it from being considered part of a brace expression. An alternation set may contain others.
{
,
{aa-zz}
tg_expand '{aunt-away}' # like perl's range op: aunt aunu .. tg_expand '{y0-2b2}' # more: y0 y1 .. z0 z1 .. 1a0 1a1 .. tg_expand '{100-1,zero}' # countdown tg_expand '{\-5-5}' # -5 -4 -3 -2 -1 0 1 2 3 4 5 tg_expand '{0.00-20.00}' # 0.00 0.01 .. 19.98 19.99 20.00 tg_expand '{(1)-(1_001)}' # (1) (2) .. (999) (1_000) (1_001) tg_expand '{*******-*}' # ******* ****** ***** **** *** ** * tg_expand ':{,\----------})' # pinocchio, animated
Ranges with negative numbers are only defined for integers (only sign and digits). Punctuation characters must be used equivalently otherwise the result is undefined.
Some shells use .. instead of -, however AFAIK all use - for character ranges. In this module a decision for unification was made.
..
-
Important: Be careful not to create a range unintentionally.
{aa-zz-5}
tg_expand '{auto-bane-1000}' # auto awga axsm azey tg_expand '{001-100-9}' # 001 010 019 028 ... 082 091 100 tg_expand '{-10-10-2}' # -10 -8 -6 -4 -2 0 2 4 6 8 10 tg_expand '{-10--20-2}' # -10 -12 -14 -16 -18 -20
The step size must be a decimal integer value greater than zero. This means tg_expand '{a-z-0}' is interpreted as ranging from a to z-0. The step size for descending ranges is only well defined if the start- and end-point are part of the result set. For other possibilities to construct ranges see List::Maker and the powerful and comprehensive List::Gen.
tg_expand '{a-z-0}'
a
z-0
List::Maker
{!bb}
{!mm-qq}
remove matching elements from the expansion set inside the same scope of braces. In other words, this operation is restricted to the nearest surrounding braces.
tg_expand '{0-20,!*[13579]}' # 0 2 4 6 8 ... 16 18 20 tg_expand '{[a-d][a-d][a-d][a-d],!*{a*a,b*b,c*c,d*d}*}' # permute
Side note: The above example is for syntax demonstration only. For calculation of permutation you better use Algorithm::Permute or such.
Algorithm::Permute
[asdf]
tg_expand '[bcdhlmprt]uff' # ... ruff tuff
One of the characters matches or else has its place in the generated set.
Consider: [aaa] and [a] are not the same in the expanding case, the first delivers a,a,a analogue to {a,a,a} or a{,,}. Also empty character classes are allowed. Therefore [][] has no special meaning - two nothings -, this entails that you must quote the containing closing bracket [\][].
[aaa]
[a]
{a,a,a}
a{,,}
[][]
[\][]
Note: Nothing is a little bit imprecise, {} and [] represent the set with the one element of the zero-length string ''. If you need cross products with empty sets then look somewhere else, for example to Set::Scalar->cartesian_product, Set::CrossProduct or List::Gen->cross.
{}
[]
''
Set::Scalar->cartesian_product
List::Gen->cross
[a-z]
The one character wide counterpart of alternation ranges.
tg_expand '[1-357-9]' # 1, 2, 3, 5, 7, 8, 9 tg_expand "[\0-\40]" # "\0", "\1", "\2", ... ' '
Whereas tg_expand "[\t- ]" results only in "\t", "\n", "\N{VT}", "\f", "\r", " ". When the start and end point belong to the class of printable, whitespace or alarm bell, then the generated output is also restricted to this.
tg_expand "[\t- ]"
"\t", "\n", "\N{VT}", "\f", "\r", " "
[ [:upper:] ]
tg_expand '[aeiou[:space:]]' tg_expand '[[:punct:]][[:punct:]]' # potential twigils
The following predefined classes are supported and in case they do not constitute a very narrow set, they are restricted to the ASCII range.
[:digit:] cumbersome way to write [0-9] [:xdigit:] [0-9a-f] [:punct:] punctuation chars ala POSIX (locale ignored) [:space:] whitespace [:blank:] "\t" & " " [:lower:] [a-z] [:upper:] [A-Z] [:alpha:] [A-Za-z] [:lowernum:] [a-z0-9] [:uppernum:] [A-Z0-9] [:cardsym:] spade heart diam club, both colors black first [:card:] playing cards [:die:] all sides of a die [:chess:] chessmen, white makes the first move [:mahjong:] tiles of mah-jong [:trigram:] base for the i ching [:zodiac:] signs of zodiac [:note:] musical notes [:smiley:] the unicode consortium knows about 59 different emotions expressible by a circle with points and lines in it. [:planet:] symbols of planets of our solar system, with the sun (a star) so the name is not really correct. [:polygon:] triangle,quadrangle,pentagon & hexagon [:legal:] the sign for (C),(R),(P),SM,TM. [:roman:] roman numerals (but not the ASCII substitute)
With a - at the end the generating order is reversed.
[:digit-:] [9-0]
[ [:lower4-6,20:] ]
[:lower12-14:] l, m & n [:cardsym1-4:] black suit [:cardsym6,2:] red and black heart [:card1:] ace of spades [:card1,1,1,1:] swindler [:lowernum-27:] [9-0] again
The numbering begins with one, so [:luck1-:] is identical to [:luck:], and [:luck-1:] is identical to [:luck-:]. This numbering scheme has the pitfall that [:digit1:] is 0. Yet starting with one is more natural in most cases.
[:luck1-:]
[:luck:]
[:luck-1:]
[:luck-:]
[:digit1:]
0
tg_expand '[[:card1-11,13-25,27-39,41-53,55-56:]]' tg_expand '{[[:card1-56:]],![[:card12,26,40,54:]]}' # both: all cards without jokers and knights
{ }#9
{ }#0-9
[ ]#9
[ ]#0-9
tg_expand '[01]#8' # 0..255 in binary tg_expand '[abc]#0-1' # optional, same as {,[abc]} tg_expand 'AB{inside comment}#0BA' # ABBA tg_expand ':[-]#0-8)' # pinocchio again tg_expand '# {-=}#38-' # decoration line tg_expand '[a]#10' # by the doctor tg_expand '1[_,]#0-1\200' # =1{,[_,]}200: 1200 1_200 1,200
A {pattern}#n is the same as repeating the pattern n times ({pattern}{pattern}...). Being an expander feature it needs a finite upperbound. If you need more power, the Regexp::Genex module is worth a try.
}#
}{
}
Regexp::Genex
For matching using the builtin Regexp is preferable. Most widely known are the (non-expanding) ksh style variants ?(), *(), +(), @() and !(). The use of # maybe looks familiar to zsh users, but the meaning is different to the (yet another) matching only extension from zsh.
?()
*()
+()
@()
!()
#
{ }##9
[ ]##9
tg_expand '[01]##8' # only: 00000000 11111111 tg_expand ':[-]##8)' # pinocchio, unanimated tg_expand '# {-=}##38-' # as # tg_expand '[a]##10' # as # tg_expand '{([_]#0-3)}##2' # S-XXL tg_expand '{[a-d]#4,!{*[a-d]}##2*}' # permutation again
Not the pattern, the element is repeated. Where [ab]#2 produces aa, ab, ba and bb, the same as [ab][ab]; [ab]##2 produces only aa and bb, here only the resulting element is duplicated. Mnemonic: the repetition is done later so two #.
[ab]#2
aa
ab
ba
bb
[ab][ab]
[ab]##2
?
*
**
***
* Match zero or more of characters, except those listed in unchar=>'...'. Also tests against condition of unhead=>'...' ? Match a single character, honors unchar- & unhead-option. *** Match any string of characters by ignoring unchar & unhead. ** All letters (not restricted by unhead) are allowed inside, but bordered by 'unchar' against other characters. This whole-parts-only resembles multiple directory semantic. Fallback to *-behavior if unchar is not set.
The wildcards have slightly different behavior if matching, subtracting or expansion. In expansions it stands only for a single best-fitting value instead of all.
For better understanding of the difference between ** and ***, here a description of how to replace one by the other:
*** {*,*/**,**/*,*/**/*} # unchar=>'/' assumed, and unhead unset ** {/,/***/} # ignoring cases at the start or end
Using tg_grep {unchar=>'/'},'a**d',... would match a/d, a/b/d, a/b/c/d and so on, but not ad, ab/d, ab/cd. The variant a/**/d additionally doesn't match a/d. While a***d matches all the examples above. The **-behavior (when {unchar=>'/',unhead=>'.'} options are set) is comparable to that in many shells (after shopt -s globstar is applied). . Some examples: assuming we have a list of paths - domain familiar to the majority - and unchar is set accordingly:
tg_grep {unchar=>'/'},'a**d',
a/d
a/b/d
a/b/c/d
ad
ab/d
ab/cd
a/**/d
a***d
{unchar=>'/',unhead=>'.'}
shopt -s globstar
unchar
/** or /*** match absolute paths ?*** match relative paths (whereas ?** = {?,?/**}) **/ or ***/ any paths ending in / (aka marked directories) **file that file anywhere ***ext file with that ext, wherever **file.* that file with whatever extension anywhere dir** dir and all its subdirectories with all files inside dir/***ext all files with that ext under that dir or its subdirs **subdir/* all files inside such named subdirs **subdir** subdir and everything beneath
See further in the option section for adaptable behavior, e.g. through options like unchar and unhead.
unhead
\
The backslash \ forces the following (meta)character to loose its special meaning, so that it is used verbatim.
Instead of space separator from the original csh glob facility, you can use:
textglob '{foo,bar}' textglob [qw'foo bar']
The rest, this includes space and parentheses, and per default also slash, tilde, equal sign and (leading) dot constitutes normal text. But some of the option switches allow a more shellish handling.
Per default the pattern is implicitly anchored at both sides. Besides using *....* for suppression, an anchored-option exists.
anchored
tg_grep 'jam', qw'pyjamas jamboree',{anchored=>0}
If you are feeling lucky you can try another very experimental feature of explicit anchors. These can be turned on with anchors.
anchors
tg_expand 'for{$,,ever and }ever',{anchors=>1} # for, forever, forever and ever tg_expand 'flop{^,$}flip',{anchors=>1} # flip, flop
It is important to note that in case of use for matching the start anchor ^ has the restriction that only variable length pattern which can go down to zero are allowed to precede.
^
tg_options anchors => 1; tg_grep 'flop{^,$}flip',qw'flip flop' # only a flop tg_grep '*{/a/,^}bla', qw'where/ever/a/bla bla' # works
However such limitation does not apply to the end anchor $. (The acting of the end anchor $ is more consistent to the use for expansion.)
$
tg_grep 'for{$,,ever and }ever', # fine, match all 'for','forever','forever and ever'
Don't jumble the two options! They have very different effects.
Options influence the behavior and extend the adaptability and thereby the range of application/usage opportunities.
Options can be supplied directly to the function call or already when loading the module. So you don't have to repeat it if you use the same options in row.
use Text::Glob::DWIW ':all', { unchar => '/' }; tg_options { case => 0 }; # tg_options case => 0; also works say for tg_grep { anchored => 0 }, 'falling stars', ...;
use
Hereby options must be specified as hash reference at the end. This method only allows constant (compile-time known) values. The options act in all function calls which are inside the same lexical scope as the use statement. Declaring another use in an narrower scope can be done. These options are only set once at compile time, and therefore don't reset if the program flow arrives at them another time. The combined behavior with textglob_options call (from inside the same scope) is loosely comparable with state variables.
state
As shorthand notation - instead of always writing out the full package name - the tag :use can be added to the first import. After doing that, use TGDWIW { } is available alternatively.
:use
use TGDWIW { }
Here the validity is the scope of the next outer use statement. A restriction to constants doesn't exist. If needed a use clause (with or without options) and a following textglob_options can be combined.
The options must be handed over as the first or as last parameter in the form of a hash reference { }. Options are only considered for that function and override options set otherwise.
{ }
Warning: The presetting capabilities works only by use of explicit use without scope related indirections.
quant
#,##
The quantifier ...#n-m and ...##n can be turned off, then a # behind { } or [ ] acts as a normal character.
##
[ ]
range
{},[]
The {0-100} and [a-z] can be turned off. Then the hyphen-minus (-) is handled like a normal character.
{0-100}
charclass
def1,sort0
Some character class features are also switchable. E.g. the predefined character classes [[:punct:]] can be turned off with {charclass=>'def0'}. The result is then like the feature doesn't exists. For example [[:punct:]] is interpreted as a char class with [\[:punct:] and a following ] which generates [], :], p], u], ... t], :].
[[:punct:]]
{charclass=>'def0'}
[\[:punct:]
]
:]
p]
u]
t]
Some shells generate brace sequences in natural order, but sort the contribution from char classes in ascending order. With {charclass=>'sort+'} this can simulated, and sort- is for descending order.
{charclass=>'sort+'}
sort-
minus
1
The subtracting with { ,! } can also be turned off.
{ ,! }
Basic support for explicit anchors exists. This feature is known to be buggy, and is therefore turned off by default. Turn it only on if you can not live without it.
But maybe you have searched for the anchored-option anyway, which can be found in the following section about options for matching.
tilde
undef
Through this option the handling of tilde expansion is available:
say tg_expand '~{he,she,it,sking}/path',{tilde=>'/home/'};
More powerful possibilities are offered by using coderefs:
sub tilde_expand ($$$) { my ($what,$arg,$delim)=@_; my $nyi= $what eq '~' && $arg!~/^[+-\d]/ && $delim=~qr'^/?$'; return unless $nyi; # don't change File::HomeDir->${$arg eq '' ? \'my_home' : \'users_home'}($arg) } say tg_expand $p='~{he,she,it,sking}/path',{tilde=>\&tilde_expand};
Typical meanings (mentioned here only so you know what is your part ;-):
~user, ~{user} File::HomeDir->users_home($user) ~ File::HomeDir->my_home ~- $ENV{OLDPWD} # or whatever is available in perl ~+n (`dirs`)[$n] ~-n (`dirs`)[-$n-1] =file File::Which::which($file)
The subref/closure variant is not combinable with the tree or chunk option. It is also not available in combination with the object interface. This matches only at the beginning of patterns. But differently to shell behavior a path separator sign (e.g. : under Unix) is not honored. Split it yourself beforehand.
:
break
Too easily big lists can be generated by simple patterns.
tg_expand '[0123][0123][0123][0123][0123]',{break=>1000}; # die
This option allows to set an upper bound for the size of a generated list. It dies if this limit is reached. Use it in an eval block for catching if you turn this feature on.
die
eval
Some assumptions are made:
Only sets which are going to be constructed are handled. The reasoning is that for matching, more complex patterns are processable, and so the are accepted.
Size of interim sets are checked.
Checks are only done when: a quantifier is used, a cross product happens, and by ranges.
Ranges are often only roughly guessed, ...
No analysis of cost is done, so {aaaa-zzzz} is considered to have the same costs as [a-z][a-z][a-z][a-z], and even {[a-z][a-z][a-z][a-z],![a-z][a-z][a-z][a-z]}.
{aaaa-zzzz}
[a-z][a-z][a-z][a-z]
{[a-z][a-z][a-z][a-z],![a-z][a-z][a-z][a-z]}
Only the growth of elements counts, and not the growth of the size of a single element is considered. Here the important exception is ##n. Without this the value of the whole option would be questionable.
The size of a single element from the input always counts as one.
So the last point means, that you must also restrict the length of the input field! Otherwise:
tg_expand 'a'x10_000_000,{break=>1} # no die, the death himself
The value you should set for break depends on the power you have. In the following, values are from a weak machine and should be considered as a starting point. A value between 1000 and 3000 seems reasonable, if you forbid the subtractive pattern by setting {minus=>0}. With this costly feature enabled, a value of 100 seems to fit better.
{minus=>0}
Set also the stepsize-option to a reasonable value e.g. -100.
stepsize
And please don't rely on this feature. This is most likely not ready for security sensitive production environments! Maybe combining it with modules like Time::Out helps.
Time::Out
Ranges with step size can be turned off or limited.
undef: If set to undef the step size feature is completely turned off. Then step sizes are not recognized as such and the appendage is interpreted as a part of the range's end point.
0: A value of 0 means no limit.
>0: If a number greater zero is set, then this is the maximal allowed step size. If this size is exceeded, an exception is thrown. See eval in perlfunc for handling.
>0
<0: If a negative number is given, this is a kind of soft limit, that influence the internal element count prediction. This has only an effect if break is also set.
<0
tg_expand '{1-100-100}',{stepsize=>-10,break=>10} # '1' tg_expand '{1-100-100}',{stepsize=>-10,break=>9 } # die
Note: Actually for non integer ranges an extended magic increment is used, which can get CPU intensive if big steps are used. So one of the reasons for restriction is that 'magic' arithmetic operations are not yet programmed, and so delegation to repeated increments is used.
star
?,*,**,***
The wildcards ***, **, * and ? can also be turned off. With {star=>0} this symbols stop to be special and are taken verbatim. With {star=>1} they could later be brought back. Also selecting selectively is possible.
{star=>0}
{star=>1}
twin
**,***
For degrading the twin star **- and the triplet star ***-wildcard. With {twin=>0} usage these act like normal stars. It is sometimes called globstar.
{twin=>0}
As surplus {twin=>'**+'} switch ** to *** behavior and {twin=>'***-'} switch *** to ** behavior. For complete switch off, see the star-option.
{twin=>'**+'}
{twin=>'***-'}
All inputs are equal. But here you can define 'non grata' chars, which are not matched by the wildcards * and ?. This setting is ignored by the ***-wildcard, and has special meaning for the **. If unset - the default - then ***, ** and * act identically.
textglob 'fo*ba*','foo/bar','foobaz',{unchar=>'/'} # foobaz only
If the argument to the option looks like a character class, the interpretation is likewise.
If you don't want matching multiline texts, use unchar=>"\n".
unchar=>"\n"
If the string starts (or continues after an unchar) with one of the characters mentioned in the unhead-option, it is hidden from the result except it is explicit in the pattern.
Under Unix files beginning with a leading dot are called hidden. They are second class citizens (or are they éminence grise?) which are only visible if explicitly requested.
textglob {unhead=>'.'},'*', qw'. .. .bashrc fine your.txt' # find: fine, your.txt
So {unhead=>'.',unchar=>'/'} serve dotglob behavior.
{unhead=>'.',unchar=>'/'}
Instead of a list of text, returns a list of listrefs where braces subgroups are itself refs (natural list refs or scalar refs as marker). This can be seen as an alternative to the ->format feature. Compare it to the capturing feature for matching.
->format
tg_expand 'a{b,c}d',{tree=>1} # ['a',\'b','d'], ['a',\'c','d']
Instead of a list of text, returns a LoL structure, where the listrefs hold the ordered single chunks. This and the previous feature are only available for textglob_expand.
tg_expand 'a{b,c}d',{chunk=>1} # [qw'a b d'], [qw'a c d']
case
Normally matches are case sensitive {case=>1}. But you can chose to ignore case, with {case=>0}. Beside that, a extended case mode {case=>2} exists, where uppercase characters match only uppercase, but lowercase match both. This mode is best known from search engines. Then an uppercase variant {case=>3} exists where uppercase letters match both.
{case=>1}
{case=>0}
{case=>2}
{case=>3}
my @v=qw'ABC abc Abc aBC aBc Abd'; tg_grep {case=>0}, 'Abc', @v # all except Abd tg_grep {case=>1}, 'Abc', @v # only Abc tg_grep {case=>2}, 'Abc', @v # ABC Abc tg_grep {case=>3}, 'Abc', @v # abc Abc tg_grep {case=>-1},'Abc', @v # aBC tg_grep {case=>-2},'Abc', @v # ABC aBC tg_grep {case=>-3},'Abc', @v # abc aBC aBc
A mode for people with defect shift key {case=>4}, where every first character of a word if lowercase, match both.
{case=>4}
tg_grep {case=>4}, 'abc', @v # abc Abc tg_grep {case=>-4},'ABC', @v # abc Abc
Beside that also a CamelCase mode {case=>5} exists:
{case=>5}
tg_grep {case=>5}, 'CamelCase',qw'CamelCase camel_case camelcase' tg_grep {case=>-5},'CamelCase',qw'cAMELcASE c_a_m_e_lc_a_s_e cc' # find first and second
a,z
Normally pattern search is done by testing if the whole string fits. By turning off anchoring a part of the string is enough for matching. This is useful if you like to combine parts, because enclosure with * doesn't help with that. Single sided anchoring is available by setting the option to ^ or $, or by setting to a or z.
z
my @horoscopes= ... .. $astro=~/${ \tg_re '[[:zodiac:]]*', {anchored=>0,greedy=>1,unchar=>'[[:zodiac:]]'} }/g; .. $astro=~/${\tg_re'[[:zodiac:]]',{anchored=>0}} [\pP\w\s]*/xg; .. split /(?=${\tg_re '[[:zodiac:]]',{anchored=>0}})/,$astro;
Of course for that simple case, you can write:
my @horoscopes=grep !/^.$|\Q$astro/, $astro=~tg_re '{[[:zodiac:]]*}#12',{capture=>1};
The Interpolation module is recommended as assembly adhesive. If you only want to pimp up your REs, have a look at Regexp::Common.
Interpolation
Regexp::Common
invert
Inverts the matching. Only for use with textglob_match and textglob_grep. It is also fine for textglob_glob and textglob, so long as you use that because of their shorter name. But in the cases where matching is mixed with expansion, it is unlikely to do what you want.
capture
has only meaning for textglob_re. Through that {} and [] act as capture groups.
my $re=tg_re 'A {v* ,}{*} story',{capture=>1}; my @r='A very short story'=~/$re/; # 'very ', 'short'
It interacts slightly with rewrite. You can use grep defined,(...=~/$re/) to equalize the differences between these modes.
rewrite
grep defined,(
=~/$re/)
A common interface between expanding and matching would be nice, but OTOH that way it was easy to implement. It's here because it was easier to code, as to explain why it is left out. This option is likely to change or disappear in future.
greedy
Default behaviour for *, ** and *** is non-greedy (0), you can switch to greedy (1) and possessive (2).
2
'eggshells'=~tg_re '{egg*s}*',{greedy=>0,capture=>1} # eggs 'eggshells'=~tg_re '{egg*s}*',{greedy=>1,capture=>1} # as is tg_match 'sim*.bim', 'simsala.bim',{greedy=>2,unchar=>'.'} # 1 tg_match 'sim***.bim','simsala.bim',{greedy=>2,unchar=>'.'} # 0
Best you forget that this option exists. (Consider using Regexp.)
last
The unescaping/dequoting of this module mostly follows filter semantics. So different kinds of data processing can be stacked together. Normally the escaping of the escape, so that that is verbatim, in our case a backslashed backslash \\, should be only removed from the last stage. So usage requires no knowledge of the filter stack depth. So composited tools can be seen as a blackbox. If this module is not the last stage, you can set this option to 'off' last=>0, then \\ would not be dequoted, an the protected and the protecting backslash would be handled down as they are. This applies only to expansion.
\\
last=>0
Instead of expanding, the pattern is only rewritten to a normalised, simpler form. This is the default interim format for matching.
tg_expand 'foo{[ab][01]}#2{[ab][01]}##2ba[rz]',{rewrite=>1} # foo{{{a,b}{0,1}}{{a,b}{0,1}}}{a0a0,a1a1,b0b0,b1b1}ba{r,z}
If necessary - for ## element repeat or !... subtraction - the pattern is partly expanded. Also the last-option is ignored, and always off. The wildcards are transferred as is, so under expansion the star-option is meaningless.
!
It is useful for debugging to see the pattern differently or to detect if rewrite=>0 changes what is matched, however it shouldn't.
rewrite=>0
Another use case is feeding the rewritten pattern to another module which understands basic patterns, but you prefer the fancy ones.
backslash
In combination with unchar, backslash can be used to allow a preceding backslash (in the text domain) to disable that special meaning. Besides that, the backslashed sequence counts as a single char for the ?-wildcard. Remember that this option only inflects wildcards and so the backslash must be written out in explicit parts of the pattern.
Unstable, and candidate for removal.
my @v=("ab", "c\nd", "e\\\nf"); tg_grep '*',@v,{ unchar=>"\n",backslash=>"\n" } # "ab", "e\\\nf" tg_grep '???', @v, { backslash=>1 } # "c\nd","e\\\nf" tg_grep "?\\\\\n?", @v, {backslash=>...} # "e\\\nf"
The dequoting in pattern space is not changed in any way. Don't allow this option to confuse you.
The following error messages are defined and are thrown in the respective condition.
Useless call of
in void context.
The function is called without having the possibility to return a result.
Unknown option
.
The given option isn't understood.
Error in option setting: Scope of use declaration not found.
You have loaded this module by something other than a normal use statement. In such a case a textglob_options call can trigger this error. Add an explicit use before. Otherwise you are restricted to feeding the options directly.
Too much (>
).
If break is set, and that limit is reached.
Step size too wide (>
If stepsize is greater than zero, and that limit is reached.
Can't load
Unknown module
requested.
textglob_foreign doesn't know or has trouble to load the requested module.
Because of the pragmata-style capability of lexical-scoped presetting options, the following incompatible constructs are not supported in these regards:
{ use Text::Glob::DWIW ...; } ... # outside the scope
{ use Text::Glob::DWIW ...; } ...
use Text::Glob::DWIW (); # preset feature also turned off
use Text::Glob::DWIW ();
require Text::Glob::DWIW; # not even turned on
require Text::Glob::DWIW;
eval "use Text::Glob::DWIW ...;" ...
If options are set in such situations, they are silently ignored. Furthermore textglob_options called in such context and without an existing upper scope declaration will throw an exception.
Note: In the case that some programmatic control over module loading is needed, you can use use if $test, ... and use maybe ....
use if $test, ...
use maybe ...
It is assumed that only small patterns are typically used. No optimisation for speed or against memory exhaustion is considered.
Nearly no error handling and recovery is built in. If you feed garbage, you get garbage back - most of the time. This do-the-next-best-thing strategy also means that no forward compatibility exists. So most likely your code must be adapted for new releases.
Instead of a clear design, this module was developed in a more dirty and hackish way. So regexps and inbound signaling are heavily used. Mutual recursion is used in such excessivity, that the resulting code convolution is best called higher order spaghetti.
Pretty sure (see design caveats ;-). Anyway, if you catch one, mail how to reproduce it, what you got and what you expected. And maybe on what you rely on that it doesn't change, as an action-result pair.
This module may have some advantages over TGE and SGP. I wrote it to get a glob expander which possesses that particular features. The only reason why I hacked the matching features in, was my disliking of such a longish name like Text::Glob::Expand::DWIW. So the non expander functionality is a bit rudimentary.
Especially negative character classes would be useful.
[!ab]
Not yet implemented. Sorry for that.
tg_grep '.*',{unhead=>'.'} match . and .. Most shells allow to suppress this behavior. You can add an extra layer for filtering these out:
tg_grep '.*',{unhead=>'.'}
tg_grep {invert=>1,unchar=>'/'},'**{.,..}**', tg_grep ....
/usr//tmp/* (cleanup input instead). Look at the ->cleanup-method which is offered by Path::Class. This can also help with the following points.
/usr//tmp/*
->cleanup
Path::Class
Replace /./ with / and remove ./ at start and /. at end beforehand.
/./
/
./
/.
Remove /*/.. repeatedly and also */.. from the start.
/*/..
*/..
Depending on your needs replace 'D:foobar' with Cwd::getdcwd('D:').'\\foobar' or 'D:**\\foobar'
'D:foobar'
Cwd::getdcwd('D:').'\\foobar'
'D:**\\foobar'
No special, write explicitly \{\}.
\{\}
( )
exists neither for matching nor expanding. But an every-{}-and-every-[]-is-a-marker/selector is available. If the capture-option is not enough or too cumbersome, use Regexp. This is what they are for. For expanding only the following, restricted possibilities exists: the search through the result sets of the tree-option as one option, and the Text::Glob::Expand-like tg_expand(...)->format(...) method the other. But to emphasize: You have to know your pattern because every {} and [] is marked or selected.
tg_expand(
If you have to deal with patterns that produce big result sets (and you don't like to experiment with less stable parts of this module, like the half-baked for demonstration purposes only functions textglob_foreign and textglob_expand_lazy), then sorry this module is definitely not for you.
I thought about it, especially about doing it with index arithmetic, which allows random access without the need of holding anything of the result in memory. See textglob_foreign for a restricted example with the help of Set::CartesianProduct::Lazy.
Recursive patterns shouldn't be a problem. But for magic ranges basic arithmetic operations are needed. Also subtractive patterns and wildcards which match the actual expansion set are at least difficult, maybe even impossible to solve directly. Of course an extra layer of memory-friendly hole and insertion store are possible. I hope you understand that this sounds too much as too much work. So my decision fell on the side of let-it-be instead of do-it-right.
Nevertheless it would open cool opportunities:
xx_expand('{1-*}[abc]{1-*}{1-*-2}')->[100*Inf**2+100] # result of this hypothetical routine would be: 33b1200 ;-)
Anyway I have never used globs for more than generating 1000 elements. (hmmm, maybe even only 100. But I also never tried to backup all my files in my inbox along with all that mails with big attachments in the same place. So in this 'modern' world my computer usage appears to be untypical. Ok I'm wrong, the youngsters today store the data on dropbox or skydrive make a youtube video about it and use then an url shortener for posting on facebook. This way they can be sure that a north-american suction agency makes a backup. But with backup generally: how you get it back when you need it? (Ok, maybe an additional backup onto the gmail account helps.))
Sorting can be done afterwards, and is an independent functionality - at least as long as it is not depending on the pattern. A minimalistic version of partial sorting is added for compatibility reasons {charclass=>'sort+'}. Details can be found in the options section.
It gives a few popular extensions like globbing of the zsh or the VMS DCL syntax e.g. triple dot ... instead of **. So this module has its own hard-wired syntax. (yet another. yuck.) Changing between syntaxes is offered by Regexp::Wildcards as its main feature.
...
Regexp::Wildcards
Something like tg_grep('{**}/{*}.tar.gz',...)->format('%1/new/%2.tgz',{paired=>1}) would be nice, but is not implemented. You can use textglob_re with capture=>1, and then perl's s///. If you have to handle files then maybe File::GlobMapper fulfills your needs.
tg_grep('{**}/{*}.tar.gz',...)->format('%1/new/%2.tgz',{paired=>1})
capture=>1
s///
File::GlobMapper
glob builtin, File::Glob
Regexp::Wildcards, Text::Glob, Regexp::Shellish, Regexp::SQL::LIKE
Text::Glob::Expand, String::Glob::Permute, String::Range::Expand, Regexp::Genex (Regexp based), Data::Generate (alienated)
File::GlobMapper which is part of IO::Compress, File::Wildcard
Routes::Tiny
List::Gen (swiss army knife), Set::CartesianProduct::Lazy (lightning), Set::CrossProduct (iterating), Iterator::Array::Jagged, Math::Cartesian::Product
List::Maker, List::Gen
File::HomeDir, File::Which, Path::Class, Cwd, Time::Out, Algorithm::Permute, Set::Scalar (a real member of Set::), Interpolation, Regexp::Common, HOP::Stream, Iterator::Simple (a worthy representative for all mentioned iterator packages), if core module, maybe (nearly a philosophy: just try it)
Some test files t/02-*.t are borrowed/adapted from other mentioned above modules on CPAN which ones also provides glob functionality.
(c) 2013 Josef. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install Text::Glob::DWIW, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Glob::DWIW
CPAN shell
perl -MCPAN -e shell install Text::Glob::DWIW
For more information on module installation, please visit the detailed CPAN module installation guide.