The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Text::Glob::DWIW - Yet another Text::Glob{::Expand,}

SYNOPSIS

  use Text::Glob::DWIW ':all';
  say for textglob_expand 'glob{b,al replac}ing',
          'Text{[-_],::}Glob{[-_],::}{DWIW,DoWhatIWant}';
  my @r=textglob_grep 'a*c', qw(...);

DESCRIPTION

Text::Glob::DWIW implements glob(3) style expansion and also matching against text. If you want to look at usage examples first, jump to the textglob_expand explanation at the start of the FUNCTIONS section.

WHY

Some modules targeting that matter already exists on CPAN, e.g. for expanding Text::Glob::Expand and String::Glob::Permute, and also a handful for matching. Moreover perl itself came with two variants of globbing - <  > aka glob and bsd_glob from File::Glob, a core module - which can be (mis)used for text expansion also.

Because of that already existing plurality, this wasting of CPANs namespace demands some explanation.

In short all considered modules missed at least one of the features I liked:

  • separated from file system; or the non-interacting can be ensured

  • character classes

  • recursive pattern, like nested braces

  • expansion

  • interpretation as path and the corresponding can be turned off.

  • simple interface, no arrogation to excessive descriptiveness

  • order is determined by pattern, and looks natural.

  • syntax is not too far from what is found in most shells, and syntax extensions are integrated harmonically.

WHEN STAY AWAY

Also this module has its issues like missing functionality or performance. To make your decisions hopefully easier a big MISSING section - with hints what you can do instead - exists. See also under SEE ALSO, where other modules are mentioned which might fit your need better.

IMPORT TAGS

No functions are exported by default. They have fixed naming schemes from which you can select one: {textglob,tglob,tg}_*. So each function can be imported by three different names.

  :textglob_  import the subroutines so they begin with  textglob_... .
  :tglob_     ditto, except        tglob_...
  :tg_        again, this time as  tg_...
  :textglob   like :textglob_ but also import the  textglob() function.
  :tglob      ditto, but  textglob()  is renamed to tglob().
  :tg         you can guess it
  :all        load all under all available names.
  :use        use TGDWIW { options } available

Typical usage example: use Text::Glob::DWIW qw'textglob :tg_'.

FUNCTIONS

All the functions can be adapted by an option hash. In the following list only the long form is mentioned, and the short form is used in examples.

textglob_expand PATTERN ...

expands the glob following the syntax described under PATTERN SYNTAX. The interpretation can be adapted with options, which are given inside a hashref as first or last argument.

  tg_expand "[z-a]"  # z y x w v u t s r q ... h g f e d c b a
  tg_expand "[?-']"  # ? > = < ; : 9 8 ... 0 / . - , + * ) ( '
  tg_expand '[bcfglptwz]oo'    # boo coo foo ...
  tg_expand 'a{b{r{a{c{a{d{a{b{r{a,},},},},},},},},},}'
  # "abracadabra","abracadabr","abracadab","abracada",
  # "abracad","abraca","abrac","abra","abr","ab","a"

  tg_expand '{abra,{*}cad{*}}' # 'abra', 'abracadabra'

And also subtractive patterns are available.

  tg_expand '{a{b{r{a{c{a{d{a{b{r{a,},},},},},},},},},},!abrac}'
  tg_expand '{a{b{r{a{c{a{d{a{b{r{a,},},},},},},},},},},!*a}'
  # "abracadabr", "abracadab", "abracad", "abrac", "abr", "ab"

For more of that look at PATTERN SYNTAX. It should be mentioned that a small addition to the pattern can produce a big (exponential) increase of the resulting set. Some may think it should be better named textglob_explode.

You are warned, this function is not for generating nasty big lists.

In numeric context (forcible with int) it returns the count of results - by expanding all of these in memory beforehand.

A more kludged but also more effective counterpart (most of the time) exits: textglob_expand_lazy.

textglob_expand(PATTERN ...)->format(FORMAT ...)

allows Text::Glob::Expand style decoration and capturing, with %<num>, %<num>.<num> etc. For details of the format string, see there.

  print tg_expand('{foo,bar}')
        ->format("'%0' is used far too often in examples\n");
  print tg_expand('Robert {[A-C].,} Wilson')
        ->format("'%1' is the middle initial of '%0'\n");
  say tg_expand('{f[o]o,b[a]r}')->format("'%1.1' = middle char");
  say tg_expand('{{Cinderella}{fore},{Alice}{hind}}')
        ->format("%1.1 is be%1.2 the mirror.");
  say tg_expand('{file{0-100}}.tar.gz')->format("mv %0 %1.tgz");

  my %r=tg_expand('{f[o]o,b[a]r}')->format("%1.1",{paired=>1});
  # ( foo => 'o', bar => 'a' )
Other methods textglob_expand(PATTERN ...)->METHOD(...)

->elems() and @{  } list the result in the same form as called in list context, by ignoring the tree or chunk option. ->chunks(), ->tree() is equivalent to using the option with the same name. ->size() and int(  ) return the count of expanded elements.

It can be used as a very basic iterator with ->next(), $$o or <  >. Besides interchangeability with textglob_expand_lazy the worth of iterators hereby is low because the whole result is created in one go.

Only textglob_expand can return an object. No further OO interfaces for other functions are actually implemented.

textglob_match PATTERN STRING ...

returns whether the string matches. In list context more strings could be tested at once.

  tg_match 'a*','aaa'           # true
  tg_match 'a*',qw'aaa abc a b' # 1 1 1 0
textglob_grep PATTERN STRING ...

returns the strings which match the pattern.

  tg_grep 'a*a', qw'aa abba abracadabra' # finds all
textglob_glob PATTERN STRING ...

works like textglob_grep but also adds expansions from wildcardless sub-patterns.

textglob PATTERN  or  textglob  or  textglob PATTERN STRING ...

The one-argument form acts like textglob_expand. The parameterless variant also, but uses $_ instead. In the other cases it mimics textglob_glob. Many people like such all-in-one functionality, but it's not everyone's cup of tea. If you prefer less magic use the more explicit variants.

textglob_re PATTERN ...

transforms the glob into a regexp.

textglob_foreign PATTERN ... CLASS

transforms the glob into other classes. So for example Text::Glob::Expand has caching and can manage slightly more data. I believe that the connection to TGE is stable but has some restrictions as explained later on.

  my $obj=tg_foreign '[[:card:]]#2' => 'Text::Glob::Expand';
  say join ',',$obj->explode;

If this is not sufficient, you can try to use modules which apply index arithmetic to calculate the values and thus offer random access without the need of calculating interim values.

  my $obj=tg_foreign '{1-100}#10' => 'Set::CartesianProduct::Lazy';
  say $obj->count;                     # 1e20
  say join ' ',$obj->get(1234567890);  # 1 1 1 1 1 13 35 57 79 91

  my $obj=tg_foreign '{1-100}#10' => 'List::Gen';
  say $obj->size;                      # 1e20
  say join ' ',$obj->get(1234567890);  # 1 1 1 1 1 13 35 57 79 91

Don't expect too much here. Whether you combine the strengths or the weaknesses of the two modules depends heavily on the pattern. The non-globbing backends are more a proof of concept to show what could be possible.

Transformations are defined for  Set::CartesianProduct::LazyList::GenHOP::StreamSet::CrossProductIterator::Array::Jagged  and  Text::Glob::Expand.

Options can be given as first argument, or immediately before the CLASS.

Here the following restrictions apply:

  • No support for explicit anchors

  • ranges, wildcards and subtractive patterns are pre-resolved by this module. Depending on their position and kind most of the work is done by this module. Thus no advantage can be gained through the use of a more powerful backend.

  • currently for non-globbing backends the recursion is resolved, even in cases where the backend understood it.

  • misinterpretation is more likely (for non-globbing backends).

    A look at the lazy generation point under what is missed might be interesting.

textglob_foreign PATTERN ... ITER_TYPE
  aka textglob_expand_lazy

textglob_foreign can also generate an iterator. The same restriction applies as above.

To stress the point, this function is only partly lazy: Only the outermost expansion layer is handled lazy at most.

Following ITER_TYPEs are supported:

  'CODE'             returns a closure as iterator
  'REF'              a reference to an auto-advancing scalar(/array)
  '++'               simple(minded) iterator:  while (++$i) {...$$i...}
                     also in stock: while (defined(my $i=<..>)) {...}
  CALL => \&sub      iterate over and call sub each time
  'SIZE'             calculated size only
  'Iterator::Simple' instance of that class
  'Iterator::Simple::Lookahead'         -"-
  'Iterator'                            -"-

textglob_expand_lazy assumes in the one-argument form '++'. Besides that no difference to textglob_foreign exists. Hereby the offered builtin iterator (++) should be able to mimicry different styles of iterator interfaces and therefore should be combinable with a wide range of libraries offering 'list' processing.

Also note that the object returned from textglob_expand exports basic iteration support $$o (like REF-mode) and <  >.

textglob_options HASH

For more details see Setting of Options.

PATTERN SYNTAX

alternation {aa,bb,cc}
aka brace expansion
  tg_expand 'Read The F'.
            '{ine,unded,a{{bul,m}ous,ntastic},{ascinat,\*}ing} Manual'
  tg_expand '{m{eo,iao,e}w,purr...}m{eo,iao,e}w'    # concert at home

generates all possible combinations. The pattern {a,b}{0,1} results in the list a0, a1, b0 and b1. This means the number of resulting elements is the product of the element count of all sets. Be aware of that and take care.

A { or , may be quoted with a backslash to prevent it from being considered part of a brace expression. An alternation set may contain others.

alternation ranges {aa-zz}
aka sequence expression
  tg_expand '{aunt-away}'      # like perl's range op: aunt aunu ..
  tg_expand '{y0-2b2}'         # more: y0 y1 .. z0 z1 .. 1a0 1a1 ..
  tg_expand '{100-1,zero}'     # countdown
  tg_expand '{\-5-5}'          # -5 -4 -3 -2 -1 0 1 2 3 4 5
  tg_expand '{0.00-20.00}'     # 0.00 0.01 .. 19.98 19.99 20.00
  tg_expand '{(1)-(1_001)}'    # (1) (2) .. (999) (1_000) (1_001)
  tg_expand '{*******-*}'      # ******* ****** ***** **** *** ** *
  tg_expand ':{,\----------})' # pinocchio, animated

Ranges with negative numbers are only defined for integers (only sign and digits). Punctuation characters must be used equivalently otherwise the result is undefined.

Some shells use .. instead of -, however AFAIK all use - for character ranges. In this module a decision for unification was made.

Important: Be careful not to create a range unintentionally.

alternation ranges with step size {aa-zz-5}
  tg_expand '{auto-bane-1000}'  # auto awga axsm azey
  tg_expand '{001-100-9}'       # 001 010 019 028 ... 082 091 100
  tg_expand '{-10-10-2}'        # -10 -8 -6 -4 -2 0 2 4 6 8 10
  tg_expand '{-10--20-2}'       # -10 -12 -14 -16 -18 -20

The step size must be a decimal integer value greater than zero. This means tg_expand '{a-z-0}' is interpreted as ranging from a to z-0. The step size for descending ranges is only well defined if the start- and end-point are part of the result set. For other possibilities to construct ranges see List::Maker and the powerful and comprehensive List::Gen.

alternation subtraction {!bb} & {!mm-qq}

remove matching elements from the expansion set inside the same scope of braces. In other words, this operation is restricted to the nearest surrounding braces.

  tg_expand '{0-20,!*[13579]}'         # 0 2 4 6 8 ... 16 18 20
  tg_expand '{[a-d][a-d][a-d][a-d],!*{a*a,b*b,c*c,d*d}*}' # permute

Side note: The above example is for syntax demonstration only. For calculation of permutation you better use Algorithm::Permute or such.

character classes [asdf]
  tg_expand '[bcdhlmprt]uff'           # ... ruff tuff

One of the characters matches or else has its place in the generated set.

Consider: [aaa] and [a] are not the same in the expanding case, the first delivers  a,a,a  analogue to {a,a,a} or a{,,}. Also empty character classes are allowed. Therefore [][] has no special meaning - two nothings -, this entails that you must quote the containing closing bracket [\][].

Note: Nothing is a little bit imprecise, {} and [] represent the set with the one element of the zero-length string ''. If you need cross products with empty sets then look somewhere else, for example to  Set::Scalar->cartesian_product,  Set::CrossProduct  or  List::Gen->cross.

character ranges [a-z]

The one character wide counterpart of alternation ranges.

  tg_expand '[1-357-9]'  # 1, 2, 3,  5,  7, 8, 9
  tg_expand "[\0-\40]"   # "\0", "\1", "\2", ... ' '

Whereas tg_expand "[\t- ]" results only in "\t", "\n", "\N{VT}", "\f", "\r", " ". When the start and end point belong to the class of printable, whitespace or alarm bell, then the generated output is also restricted to this.

predefined char classes [  [:upper:]  ]
  tg_expand '[aeiou[:space:]]'
  tg_expand '[[:punct:]][[:punct:]]'   # potential twigils

The following predefined classes are supported and in case they do not constitute a very narrow set, they are restricted to the ASCII range.

  [:digit:]       cumbersome way to write [0-9]
  [:xdigit:]      [0-9a-f]
  [:punct:]       punctuation chars ala POSIX (locale ignored)
  [:space:]       whitespace
  [:blank:]       "\t" & " "
  [:lower:]       [a-z]
  [:upper:]       [A-Z]
  [:alpha:]       [A-Za-z]
  [:lowernum:]    [a-z0-9]
  [:uppernum:]    [A-Z0-9]
  [:cardsym:]     spade heart diam club, both colors black first
  [:card:]        playing cards
  [:die:]         all sides of a die
  [:chess:]       chessmen, white makes the first move
  [:mahjong:]     tiles of mah-jong
  [:trigram:]     base for the  i ching
  [:zodiac:]      signs of zodiac
  [:note:]        musical notes
  [:smiley:]      the unicode consortium knows about
                  59 different emotions expressible by
                  a circle with points and lines in it.
  [:planet:]      symbols of planets of our solar system, with the
                  sun (a star) so the name is not really correct.
  [:polygon:]     triangle,quadrangle,pentagon & hexagon
  [:legal:]       the sign for (C),(R),(P),SM,TM.
  [:roman:]       roman numerals (but not the ASCII substitute)

With a - at the end the generating order is reversed.

  [:digit-:]      [9-0]
subset from predefined char classes [  [:lower4-6,20:]  ]
  [:lower12-14:]  l, m & n
  [:cardsym1-4:]  black suit
  [:cardsym6,2:]  red and black heart
  [:card1:]       ace of spades
  [:card1,1,1,1:] swindler
  [:lowernum-27:] [9-0] again

The numbering begins with one, so [:luck1-:] is identical to [:luck:], and [:luck-1:] is identical to [:luck-:]. This numbering scheme has the pitfall that [:digit1:] is 0. Yet starting with one is more natural in most cases.

  tg_expand '[[:card1-11,13-25,27-39,41-53,55-56:]]'
  tg_expand '{[[:card1-56:]],![[:card12,26,40,54:]]}'
  # both: all cards without jokers and knights
pattern quantifier {  }#9, {  }#0-9
    and further [  ]#9, [  ]#0-9
aka list exponentiation
  tg_expand '[01]#8'                 # 0..255 in binary
  tg_expand '[abc]#0-1'              # optional, same as {,[abc]}
  tg_expand 'AB{inside comment}#0BA' # ABBA
  tg_expand ':[-]#0-8)'              # pinocchio again
  tg_expand '# {-=}#38-'             # decoration line
  tg_expand '[a]#10'                 # by the doctor
  tg_expand '1[_,]#0-1\200'          # =1{,[_,]}200: 1200 1_200 1,200

A {pattern}#n is the same as repeating the pattern n times ({pattern}{pattern}...). Being an expander feature it needs a finite upperbound. If you need more power, the Regexp::Genex module is worth a try.

For matching using the builtin Regexp is preferable. Most widely known are the (non-expanding) ksh style variants ?(), *(), +(), @() and !(). The use of # maybe looks familiar to zsh users, but the meaning is different to the (yet another) matching only extension from zsh.

element repeat {  }##9 or [  ]##9
  tg_expand '[01]##8'                  # only: 00000000 11111111
  tg_expand ':[-]##8)'                 # pinocchio, unanimated
  tg_expand '# {-=}##38-'              # as #
  tg_expand '[a]##10'                  # as #
  tg_expand '{([_]#0-3)}##2'           # S-XXL
  tg_expand '{[a-d]#4,!{*[a-d]}##2*}'  # permutation again

Not the pattern, the element is repeated. Where [ab]#2 produces  aa, ab, ba  and  bb,  the same as [ab][ab]; [ab]##2 produces only  aa and  bb,  here only the resulting element is duplicated. Mnemonic: the repetition is done later so two  #.

wildcards ?, *, ** & ***
  *    Match zero or more of characters, except those listed in
       unchar=>'...'. Also tests against condition of unhead=>'...'
  ?    Match a single character, honors unchar- & unhead-option.
  ***  Match any string of characters by ignoring unchar & unhead.
  **   All letters (not restricted by unhead) are allowed inside,
       but bordered by 'unchar' against other characters.
       This whole-parts-only resembles multiple directory semantic.
       Fallback to *-behavior if unchar is not set.

The wildcards have slightly different behavior if matching, subtracting or expansion. In expansions it stands only for a single best-fitting value instead of all.

For better understanding of the difference between  ** and ***, here a description of how to replace one by the other:

  ***   {*,*/**,**/*,*/**/*} # unchar=>'/' assumed, and unhead unset
  **    {/,/***/}            # ignoring cases at the start or end

Using tg_grep {unchar=>'/'},'a**d',... would match  a/d,  a/b/d,  a/b/c/d  and so on, but not  ad,  ab/d,  ab/cd.  The variant  a/**/d additionally doesn't match  a/d. While a***d matches all the examples above. The **-behavior (when {unchar=>'/',unhead=>'.'} options are set) is comparable to that in many shells (after  shopt -s globstar is applied). . Some examples: assuming we have a list of paths - domain familiar to the majority - and unchar is set accordingly:

  /** or /***  match absolute paths
  ?***         match relative paths (whereas ?** = {?,?/**})
  **/ or ***/  any paths ending in / (aka marked directories)
  **file       that file anywhere
  ***ext       file with that ext, wherever
  **file.*     that file with whatever extension anywhere
  dir**        dir and all its subdirectories with all files inside
  dir/***ext   all files with that ext under that dir or its subdirs
  **subdir/*   all files inside such named subdirs
  **subdir**   subdir and everything beneath

See further in the option section for adaptable behavior, e.g. through options like unchar and unhead.

the escape character \

The backslash \ forces the following (meta)character to loose its special meaning, so that it is used verbatim.

word splitting (alternatives for csh'ish space separator)

Instead of space separator from the original csh glob facility, you can use:

  textglob '{foo,bar}'
  textglob [qw'foo bar']
normal text

The rest, this includes space and parentheses, and per default also slash, tilde, equal sign and (leading) dot constitutes normal text. But some of the option switches allow a more shellish handling.

anchors

Per default the pattern is implicitly anchored at both sides. Besides using *....* for suppression, an anchored-option exists.

  tg_grep 'jam', qw'pyjamas jamboree',{anchored=>0}

If you are feeling lucky you can try another very experimental feature of explicit anchors. These can be turned on with anchors.

  tg_expand 'for{$,,ever and }ever',{anchors=>1}
  # for, forever, forever and ever
  tg_expand 'flop{^,$}flip',{anchors=>1}  # flip, flop

It is important to note that in case of use for matching the start anchor ^ has the restriction that only variable length pattern which can go down to zero are allowed to precede.

  tg_options anchors => 1;
  tg_grep 'flop{^,$}flip',qw'flip flop'     # only a flop
  tg_grep '*{/a/,^}bla', qw'where/ever/a/bla bla' # works

However such limitation does not apply to the end anchor $. (The acting of the end anchor $ is more consistent to the use for expansion.)

  tg_grep 'for{$,,ever and }ever',      # fine, match all
          'for','forever','forever and ever'

Don't jumble the two options! They have very different effects.

OPTIONS

Options influence the behavior and extend the adaptability and thereby the range of application/usage opportunities.

Setting of Options

Options can be supplied directly to the function call or already when loading the module. So you don't have to repeat it if you use the same options in row.

  use Text::Glob::DWIW ':all', { unchar => '/' };
  tg_options { case => 0 };    # tg_options case => 0; also works
  say for tg_grep { anchored => 0 }, 'falling stars', ...;
Appended to the  use  statement

Hereby options must be specified as hash reference at the end. This method only allows constant (compile-time known) values. The options act in all function calls which are inside the same lexical scope as the  use statement. Declaring another  use  in an narrower scope can be done. These options are only set once at compile time, and therefore don't reset if the program flow arrives at them another time. The combined behavior with  textglob_options call (from inside the same scope) is loosely comparable with   state variables.

As shorthand notation - instead of always writing out the full package name - the tag  :use can be added to the first import. After doing that,  use TGDWIW {  }  is available alternatively.

Through the  textglob_options  function

Here the validity is the scope of the next outer  use statement. A restriction to constants doesn't exist. If needed a  use clause (with or without options) and a following  textglob_options  can be combined.

Directly supplied to the function call

The options must be handed over as the first or as last parameter in the form of a hash reference {  }. Options are only considered for that function and override options set otherwise.

Warning: The presetting capabilities works only by use of explicit  use  without scope related indirections.

General Options

quant (default: '#,##')

The quantifier ...#n-m and ...##n can be turned off, then a  #  behind  {  } or [  ]  acts as a normal character.

range (default: '{},[]')

The {0-100} and [a-z] can be turned off. Then the hyphen-minus (-) is handled like a normal character.

charclass (default: 'def1,sort0')

Some character class features are also switchable. E.g. the predefined character classes [[:punct:]] can be turned off with {charclass=>'def0'}. The result is then like the feature doesn't exists. For example [[:punct:]] is interpreted as a char class with [\[:punct:] and a following ] which generates  [], :], p], u], ... t], :].

Some shells generate brace sequences in natural order, but sort the contribution from char classes in ascending order. With {charclass=>'sort+'} this can simulated, and sort- is for descending order.

minus (default: 1)

The subtracting with {   ,!  } can also be turned off.

anchors (default: '')

Basic support for explicit anchors exists. This feature is known to be buggy, and is therefore turned off by default. Turn it only on if you can not live without it.

But maybe you have searched for the anchored-option anyway, which can be found in the following section about options for matching.

tilde (default: undef)

Through this option the handling of tilde expansion is available:

  say tg_expand '~{he,she,it,sking}/path',{tilde=>'/home/'};

More powerful possibilities are offered by using coderefs:

  sub tilde_expand ($$$)
  { my ($what,$arg,$delim)=@_;
    my $nyi= $what eq '~' && $arg!~/^[+-\d]/ && $delim=~qr'^/?$';
    return unless $nyi; # don't change
    File::HomeDir->${$arg eq '' ? \'my_home' : \'users_home'}($arg)
  }
  say tg_expand $p='~{he,she,it,sking}/path',{tilde=>\&tilde_expand};

Typical meanings (mentioned here only so you know what is your part ;-):

  ~user, ~{user} File::HomeDir->users_home($user)
  ~              File::HomeDir->my_home
  ~-             $ENV{OLDPWD} # or whatever is available in perl
  ~+n            (`dirs`)[$n]
  ~-n            (`dirs`)[-$n-1]
  =file          File::Which::which($file)

The subref/closure variant is not combinable with the tree or chunk option. It is also not available in combination with the object interface. This matches only at the beginning of patterns. But differently to shell behavior a path separator sign (e.g. : under Unix) is not honored. Split it yourself beforehand.

break (default: 0 # =off)

Too easily big lists can be generated by simple patterns.

  tg_expand '[0123][0123][0123][0123][0123]',{break=>1000}; # die

This option allows to set an upper bound for the size of a generated list. It dies if this limit is reached. Use it in an eval block for catching if you turn this feature on.

Some assumptions are made:

  • Only sets which are going to be constructed are handled. The reasoning is that for matching, more complex patterns are processable, and so the are accepted.

  • Size of interim sets are checked.

  • Checks are only done when: a quantifier is used, a cross product happens, and by ranges.

  • Ranges are often only roughly guessed, ...

  • No analysis of cost is done, so {aaaa-zzzz} is considered to have the same costs as [a-z][a-z][a-z][a-z], and even {[a-z][a-z][a-z][a-z],![a-z][a-z][a-z][a-z]}.

  • Only the growth of elements counts, and not the growth of the size of a single element is considered. Here the important exception is ##n. Without this the value of the whole option would be questionable.

  • The size of a single element from the input always counts as one.

So the last point means, that you must also restrict the length of the input field! Otherwise:

  tg_expand 'a'x10_000_000,{break=>1} # no die, the death himself

The value you should set for break depends on the power you have. In the following, values are from a weak machine and should be considered as a starting point. A value between 1000 and 3000 seems reasonable, if you forbid the subtractive pattern by setting {minus=>0}. With this costly feature enabled, a value of 100 seems to fit better.

Set also the stepsize-option to a reasonable value e.g. -100.

And please don't rely on this feature. This is most likely not ready for security sensitive production environments! Maybe combining it with modules like Time::Out helps.

stepsize (default: 0 # =on, without restriction)

Ranges with step size can be turned off or limited.

  • undef: If set to undef the step size feature is completely turned off. Then step sizes are not recognized as such and the appendage is interpreted as a part of the range's end point.

  • 0: A value of 0 means no limit.

  • >0: If a number greater zero is set, then this is the maximal allowed step size. If this size is exceeded, an exception is thrown. See eval in perlfunc for handling.

  • <0: If a negative number is given, this is a kind of soft limit, that influence the internal element count prediction. This has only an effect if break is also set.

      tg_expand '{1-100-100}',{stepsize=>-10,break=>10} # '1'
      tg_expand '{1-100-100}',{stepsize=>-10,break=>9 } # die

Note: Actually for non integer ranges an extended magic increment is used, which can get CPU intensive if big steps are used. So one of the reasons for restriction is that 'magic' arithmetic operations are not yet programmed, and so delegation to repeated increments is used.

Options which are specific for Wildcards

star (default: '?,*,**,***')

The wildcards ***, **, * and ? can also be turned off. With {star=>0} this symbols stop to be special and are taken verbatim. With {star=>1} they could later be brought back. Also selecting selectively is possible.

twin (default: '**,***')

For degrading the twin star **- and  the triplet star ***-wildcard. With {twin=>0} usage these act like normal stars. It is sometimes called globstar.

As surplus {twin=>'**+'} switch ** to *** behavior and {twin=>'***-'} switch *** to ** behavior. For complete switch off, see the star-option.

unchar (default: '')

All inputs are equal. But here you can define 'non grata' chars, which are not matched by the wildcards * and ?. This setting is ignored by the ***-wildcard, and has special meaning for the **. If unset - the default - then ***, ** and * act identically.

  textglob 'fo*ba*','foo/bar','foobaz',{unchar=>'/'}    # foobaz only

If the argument to the option looks like a character class, the interpretation is likewise.

If you don't want matching multiline texts, use  unchar=>"\n".

unhead (default: '')

If the string starts (or continues after an unchar) with one of the characters mentioned in the unhead-option, it is hidden from the result except it is explicit in the pattern.

Under Unix files beginning with a leading dot are called hidden. They are second class citizens (or are they éminence grise?) which are only visible if explicitly requested.

  textglob {unhead=>'.'},'*', qw'. .. .bashrc fine your.txt'
  # find: fine, your.txt

So {unhead=>'.',unchar=>'/'} serve dotglob behavior.

Options for Expanding

tree (default: 0)

Instead of a list of text, returns a list of listrefs where braces subgroups are itself refs (natural list refs or scalar refs as marker). This can be seen as an alternative to the ->format feature. Compare it to the capturing feature for matching.

  tg_expand 'a{b,c}d',{tree=>1}  # ['a',\'b','d'], ['a',\'c','d']
chunk (default: 0)

Instead of a list of text, returns a LoL structure, where the listrefs hold the ordered single chunks. This and the previous feature are only available for textglob_expand.

  tg_expand 'a{b,c}d',{chunk=>1} # [qw'a b d'], [qw'a c d']

Options which influence Matching

case (default: 1)

Normally matches are case sensitive {case=>1}. But you can chose to ignore case, with {case=>0}. Beside that, a extended case mode {case=>2} exists, where uppercase characters match only uppercase, but lowercase match both. This mode is best known from search engines. Then an uppercase variant {case=>3} exists where uppercase letters match both.

  my @v=qw'ABC abc Abc aBC aBc Abd';
  tg_grep {case=>0}, 'Abc', @v # all except Abd
  tg_grep {case=>1}, 'Abc', @v # only Abc
  tg_grep {case=>2}, 'Abc', @v # ABC Abc
  tg_grep {case=>3}, 'Abc', @v # abc Abc
  tg_grep {case=>-1},'Abc', @v # aBC
  tg_grep {case=>-2},'Abc', @v # ABC aBC
  tg_grep {case=>-3},'Abc', @v # abc aBC aBc

A mode for people with defect shift key {case=>4}, where every first character of a word if lowercase, match both.

  tg_grep {case=>4}, 'abc', @v # abc Abc
  tg_grep {case=>-4},'ABC', @v # abc Abc

Beside that also a CamelCase mode {case=>5} exists:

  tg_grep {case=>5}, 'CamelCase',qw'CamelCase camel_case camelcase'
  tg_grep {case=>-5},'CamelCase',qw'cAMELcASE c_a_m_e_lc_a_s_e cc'
  # find first and second
anchored (default: 'a,z')

Normally pattern search is done by testing if the whole string fits. By turning off anchoring a part of the string is enough for matching. This is useful if you like to combine parts, because enclosure with * doesn't help with that. Single sided anchoring is available by setting the option to  ^ or $,  or by setting to  a or z.

  my @horoscopes= ...
  .. $astro=~/${ \tg_re '[[:zodiac:]]*',
                  {anchored=>0,greedy=>1,unchar=>'[[:zodiac:]]'} }/g;
  .. $astro=~/${\tg_re'[[:zodiac:]]',{anchored=>0}} [\pP\w\s]*/xg;
  .. split /(?=${\tg_re '[[:zodiac:]]',{anchored=>0}})/,$astro;

Of course for that simple case, you can write:

  my @horoscopes=grep !/^.$|\Q$astro/,
                 $astro=~tg_re '{[[:zodiac:]]*}#12',{capture=>1};

The Interpolation module is recommended as assembly adhesive. If you only want to pimp up your REs, have a look at Regexp::Common.

invert (default: 0)

Inverts the matching. Only for use with textglob_match and textglob_grep. It is also fine for textglob_glob and textglob, so long as you use that because of their shorter name. But in the cases where matching is mixed with expansion, it is unlikely to do what you want.

capture (default: 0)

has only meaning for textglob_re. Through that {} and [] act as capture groups.

  my $re=tg_re 'A {v* ,}{*} story',{capture=>1};
  my @r='A very short story'=~/$re/;  # 'very ', 'short'

It interacts slightly with rewrite. You can use grep defined,(...=~/$re/) to equalize the differences between these modes.

A common interface between expanding and matching would be nice, but OTOH that way it was easy to implement. It's here because it was easier to code, as to explain why it is left out. This option is likely to change or disappear in future.

greedy (default: 0)

Default behaviour for *, ** and *** is non-greedy (0), you can switch to greedy (1) and possessive (2).

  'eggshells'=~tg_re '{egg*s}*',{greedy=>0,capture=>1}    # eggs
  'eggshells'=~tg_re '{egg*s}*',{greedy=>1,capture=>1}    # as is

  tg_match 'sim*.bim',  'simsala.bim',{greedy=>2,unchar=>'.'} # 1
  tg_match 'sim***.bim','simsala.bim',{greedy=>2,unchar=>'.'} # 0

Best you forget that this option exists. (Consider using Regexp.)

Esoteric Options

last (default: 1)

The unescaping/dequoting of this module mostly follows filter semantics. So different kinds of data processing can be stacked together. Normally the escaping of the escape, so that that is verbatim, in our case a backslashed backslash \\, should be only removed from the last stage. So usage requires no knowledge of the filter stack depth. So composited tools can be seen as a blackbox. If this module is not the last stage, you can set this option to 'off' last=>0, then \\ would not be dequoted, an the protected and the protecting backslash would be handled down as they are. This applies only to expansion.

rewrite (default: 0 for expand, 1 for matching)

Instead of expanding, the pattern is only rewritten to a normalised, simpler form. This is the default interim format for matching.

  tg_expand 'foo{[ab][01]}#2{[ab][01]}##2ba[rz]',{rewrite=>1}
  # foo{{{a,b}{0,1}}{{a,b}{0,1}}}{a0a0,a1a1,b0b0,b1b1}ba{r,z}

If necessary - for ## element repeat or !... subtraction - the pattern is partly expanded. Also the last-option is ignored, and always off. The wildcards are transferred as is, so under expansion the star-option is meaningless.

It is useful for debugging to see the pattern differently or to detect if rewrite=>0 changes what is matched, however it shouldn't.

Another use case is feeding the rewritten pattern to another module which understands basic patterns, but you prefer the fancy ones.

backslash (default: '')

In combination with unchar, backslash can be used to allow a preceding backslash (in the text domain) to disable that special meaning. Besides that, the backslashed sequence counts as a single char for the ?-wildcard. Remember that this option only inflects wildcards and so the backslash must be written out in explicit parts of the pattern.

Unstable, and candidate for removal.

  my @v=("ab", "c\nd", "e\\\nf");
  tg_grep '*',@v,{ unchar=>"\n",backslash=>"\n" } # "ab",  "e\\\nf"
  tg_grep '???', @v, { backslash=>1 }             # "c\nd","e\\\nf"
  tg_grep "?\\\\\n?", @v, {backslash=>...}        # "e\\\nf"

The dequoting in pattern space is not changed in any way. Don't allow this option to confuse you.

ERRORS

The following error messages are defined and are thrown in the respective condition.

Useless call of ... in void context.

The function is called without having the possibility to return a result.

Unknown option ....

The given option isn't understood.

Error in option setting: Scope of use declaration not found.

You have loaded this module by something other than a normal  use statement. In such a case a  textglob_options call can trigger this error. Add an explicit  use  before. Otherwise you are restricted to feeding the options directly.

Too much (>...).

If break is set, and that limit is reached.

Step size too wide (>...).

If stepsize is greater than zero, and that limit is reached.

Can't load ...<!>
Unknown module ... requested.

textglob_foreign doesn't know or has trouble to load the requested module.

PITFALLS

Because of the pragmata-style capability of lexical-scoped presetting options, the following incompatible constructs are not supported in these regards:

  • { use Text::Glob::DWIW ...; } ...   # outside the scope

  • use Text::Glob::DWIW ();            # preset feature also turned off

  • require Text::Glob::DWIW;           # not even turned on

  • eval "use Text::Glob::DWIW ...;" ...

If options are set in such situations, they are silently ignored. Furthermore  textglob_options  called in such context and without an existing upper scope declaration will throw an exception.

Note: In the case that some programmatic control over module loading is needed, you can use  use if $test, ... and use maybe ....

CAVEATS

It is assumed that only small patterns are typically used. No optimisation for speed or against memory exhaustion is considered.

Nearly no error handling and recovery is built in. If you feed garbage, you get garbage back - most of the time. This do-the-next-best-thing strategy also means that no forward compatibility exists. So most likely your code must be adapted for new releases.

Instead of a clear design, this module was developed in a more dirty and hackish way. So regexps and inbound signaling are heavily used. Mutual recursion is used in such excessivity, that the resulting code convolution is best called higher order spaghetti.

BUGS

Pretty sure (see design caveats ;-).   Anyway, if you catch one, mail how to reproduce it, what you got and what you expected. And maybe on what you rely on that it doesn't change, as an action-result pair.

MISSING

This module may have some advantages over TGE and SGP. I wrote it to get a glob expander which possesses that particular features. The only reason why I hacked the matching features in, was my disliking of such a longish name like Text::Glob::Expand::DWIW. So the non expander functionality is a bit rudimentary.

Especially negative character classes would be useful.

Negative character classes [!ab]

Not yet implemented. Sorry for that.

Special Treatment of .. & .

tg_grep '.*',{unhead=>'.'}  match . and ..   Most shells allow to suppress this behavior. You can add an extra layer for filtering these out:

  tg_grep {invert=>1,unchar=>'/'},'**{.,..}**', tg_grep ....
Understanding of Path Syntax
repeating slash

/usr//tmp/* (cleanup input instead). Look at the ->cleanup-method which is offered by Path::Class. This can also help with the following points.

current directory .

Replace /./ with / and remove ./ at start and /. at end beforehand.

parent directory ..

Remove /*/.. repeatedly and also */.. from the start.

volumes

Depending on your needs replace 'D:foobar' with  Cwd::getdcwd('D:').'\\foobar' or 'D:**\\foobar'

csh'ish empty {}

No special, write explicitly \{\}.

Independent capturing support (  )

exists neither for matching nor expanding. But an every-{}-and-every-[]-is-a-marker/selector is available. If the capture-option is not enough or too cumbersome, use Regexp. This is what they are for. For expanding only the following, restricted possibilities exists: the search through the result sets of the tree-option as one option, and the Text::Glob::Expand-like tg_expand(...)->format(...)  method the other. But to emphasize: You have to know your pattern because every {} and [] is marked or selected.

lazy generation

If you have to deal with patterns that produce big result sets (and you don't like to experiment with less stable parts of this module, like the half-baked for demonstration purposes only functions textglob_foreign and textglob_expand_lazy), then sorry this module is definitely not for you.

I thought about it, especially about doing it with index arithmetic, which allows random access without the need of holding anything of the result in memory. See textglob_foreign for a restricted example with the help of Set::CartesianProduct::Lazy.

Recursive patterns shouldn't be a problem. But for magic ranges basic arithmetic operations are needed. Also subtractive patterns and wildcards which match the actual expansion set are at least difficult, maybe even impossible to solve directly. Of course an extra layer of memory-friendly hole and insertion store are possible. I hope you understand that this sounds too much as too much work. So my decision fell on the side of let-it-be instead of do-it-right.

Nevertheless it would open cool opportunities:

  xx_expand('{1-*}[abc]{1-*}{1-*-2}')->[100*Inf**2+100]
  # result of this hypothetical routine would be: 33b1200  ;-)

Anyway I have never used globs for more than generating 1000 elements. (hmmm, maybe even only 100. But I also never tried to backup all my files in my inbox along with all that mails with big attachments in the same place. So in this 'modern' world my computer usage appears to be untypical. Ok I'm wrong, the youngsters today store the data on dropbox or skydrive make a youtube video about it and use then an url shortener for posting on facebook. This way they can be sure that a north-american suction agency makes a backup. But with backup generally: how you get it back when you need it?   (Ok, maybe an additional backup onto the gmail account helps.))

Sorting option

Sorting can be done afterwards, and is an independent functionality - at least as long as it is not depending on the pattern. A minimalistic version of partial sorting is added for compatibility reasons   {charclass=>'sort+'}. Details can be found in the options section.

Syntax switching

It gives a few popular extensions like globbing of the zsh or the VMS DCL syntax e.g. triple dot ... instead of **. So this module has its own hard-wired syntax. (yet another. yuck.) Changing between syntaxes is offered by Regexp::Wildcards as its main feature.

Substitution

Something like  tg_grep('{**}/{*}.tar.gz',...)->format('%1/new/%2.tgz',{paired=>1}) would be nice, but is  not  implemented. You can use  textglob_re  with  capture=>1, and then perl's  s///.  If you have to handle files then maybe File::GlobMapper fulfills your needs.

SEE ALSO

file based and in the CORE

glob builtin, File::Glob

renowned, but matcher only

Regexp::Wildcards, Text::Glob, Regexp::Shellish, Regexp::SQL::LIKE

expander

Text::Glob::Expand, String::Glob::Permute, String::Range::Expand, Regexp::Genex (Regexp based),  Data::Generate (alienated)

glob-based filename substituter

File::GlobMapper which is part of IO::CompressFile::Wildcard

lightweight named capturing matcher

Routes::Tiny

list modules with fuze

List::Gen (swiss army knife), Set::CartesianProduct::Lazy (lightning), Set::CrossProduct (iterating), Iterator::Array::Jagged, Math::Cartesian::Product

list comprehension

List::Maker, List::Gen

and now for something completely different

File::HomeDir, File::Which, Path::Class, Cwd, Time::Out, Algorithm::Permute, Set::Scalar (a real member of Set::), Interpolation, Regexp::Common, HOP::Stream, Iterator::Simple (a worthy representative for all mentioned iterator packages), if core module, maybe (nearly a philosophy: just try it)

CONTRIBUTION

Some test files t/02-*.t are borrowed/adapted from other mentioned above modules on CPAN which ones also provides glob functionality.

COPYRIGHT

(c) 2013 Josef. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.