The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Regexp::Ignore - Let us ignore unwanted parts, while parsing text.

WARNING

This is an alpha code. Really. It was written in the end of 2001. It is not yet checked much. The only reason I submit it to CPAN that early is to get feedback about the idea, and hopefully to get some help in finding the many bugs that must still be in it. In our company we use this code, though, and for our needs it runs well.

SYNOPSIS

  use Regexp::IgnoreXXX;

  my $rei = new Regexp::IgnoreXXX($text, 
                                  "<!-- __INDEX__ -->");
  # split the wanted text from the unwanted text
  $rei->split();  

  # use substitution function
  $rei->s('(var)_(\d+)', '$2$1', 'gi');
  $rei->s('(\d+):(\d+)', '$2:$1');

  # merge back to get the resulted text
  my $changed_text = $rei->merge();

DESCRIPTION

Markup languages, like HTML, are difficult to parse. The reason is that you can have a line like:

  <font size=+1>H</font>ello <font size=+1>W</font>orld

How can we find the string "Hello World", in the above line, and replace it by "Hello Universe" (which is a lot deeper)? Or how can we run a speller on the text and replace the mistakes with suggestions for the correct spelling?

This module come to help you doing exactly that.

Actually the module let you first split the text to the parts you are interested in and the unwanted parts. For example, all the HTML tags can be taken as unwanted parts.

Then it let you parse the part you are interested in (while totally ignoring the unwanted parts).

In the end it let you merge back the unwanted parts with the possibly changed parts you were interested in.

There is just one catch. It uses the assumption that when you replace the above "Hello World" to "Hello Universe", all the unwanted parts between the start of the match to the end of the match, will be pushed after the text that will replace the match. This is not really understood right? Look at the example:

The text:

  <font size=+1>H</font>ello <font size=+1>W</font>orld

will be first split and we will get the "cleaned" text:

  Hello World

Then we can parse it using something like:

  s/Hello World/Hello Universe/;

This will give us the changed "cleaned" text:

  Hello Universe

When we will merge with the unwanted parts we will get

  <font size=+1>Hello Universe</font><font size=+1></font>

So, the unwanted parts in the match were pushed after the replacer.

Why this assumption?

Because. Actually, I could not find any better assumption. I can not guess what will be the unwanted parts in a match and the replacer of the match might be longer or shorter then the match itself. So, in fact, we have three reasonable possibilities: 1. Push the unwanted parts before the replacer. 2. Push the unwanted parts after the replacer. 3. Spread the unwanted parts in the replacer in the same proportions that they are spread in the match.

So I chose the second option. It is very similar to the first, and by far a lot simpler (to implement and to use) then the third.

As you see in the example above, usually it should not break the markup language. It might, though, give some surprises - in the example above, "Hello Universe" is all marked to be with bigger fonts.

All in all, I believe that it provides big help when parsing formatted texts.

So now, that we know what the module can give us, let's check how we use the module.

The class Regexp::Ignore is an abstract class: there is a method, get_tokens, in the class that is not implemented. So the user of this class must inherit it and implement the get_tokens method. The get_tokens method actually splits the text into tokens and mark them "wanted" or "unwanted".

Don't panic - it might sound very difficult, but it is not. Moreover, the module comes with some classes that already inherit from Regexp::Ignore, and you can use them. For more details about implementing the get_tokens method and an implementation example, see below.

After we have the inherited class that implements the get_tokens method, and we call split to split the text, we can go on with our parsing like the SYNOPSIS above. We can use the method s which is parallel to the perl s// operator, and if we need more complex text manipulation, we can replace text directly using the b<replace> method.

When we finish to change the text, we can call the merge method that will build the resulted text from the changed "cleaned" text and the unwanted parts.

HOW IT WORKS

OK, you don't have to read this part if you just want to use the class. However, if you are the curious type, you might find it interesting.

The get_tokens method splits the text to tokens that are kept in a list. It also creates other list that contains "wanted" flags. So actually we get a list of tokens and for each the information if it is wanted or unwanted.

The split method uses the get_tokens to create the CLEANED_TEXT and the DELIMITED_TEXT.

Let's take the example:

  <p><b>bla</b><b>_</b><b>123</b></p>
  <p><b>bLa</b><b>_</b><b>1234567</b></p>

And assuming our get_tokens mark all the HTML tags as unwanted, we will get:

      tokens list               flags list
  ---------------------      ---------------
   0:   <p>                         0
   1:   <b>                         0
   2:   bla                         1
   3:   </b>                        0   
   4:   <b>                         0
   5:   _                           1
   6:   </b>                        0
   7:   <b>                         0
   8:   123                         1
   9:   </b>                        0
  10:   </p>                        0
  11:   <p>                         0
  12:   <b>                         0
  13:   bLa                         1
  14:   </b>                        0   
  15:   <b>                         0
  16:   _                           1
  17:   </b>                        0
  18:   <b>                         0
  19:   123456                      1
  20:   </b>                        0
  21:   </p>                        0

The CLEANED_TEXT will be:

   bla_123bLa_1234567

And if the delimiter pattern is "<!-- __INDEX__ -->" the DELIMITED_TEXT will be:

   <!-- 000000000 --><!-- 000000001 -->bla
   <!-- 000000003 --><!-- 000000004 -->_
   <!-- 000000006 --><!-- 000000007 -->123
   <!-- 000000009 --><!-- 000000010 -->
   <!-- 000000011 --><!-- 000000012 -->bLa
   <!-- 000000014 --><!-- 000000015 -->_
   <!-- 000000017 --><!-- 000000018 -->1234567
   <!-- 000000020 --><!-- 000000021 -->

Now the split method generates an array that contains a translation of the positions between the cleaned text and the delimited text:

   CLEANED_TO_DELIMITED_POSITIONS array
   ------------------------------------
   0:     36
   1:     37 
   2:     38
   3:     75
   4:    112
   5:    113
   6:    114
   7:    187
   8:    188
   9:    189
  10:    226
  11:    263
  12:    264
  13:    265
  14:    266
  15:    267
  16:    268
  17:    269
 

The following rulers with the cleaned and delimited texts might help you understand this the translation table:

The CLEANED_TEXT:

             1
   012345678901234567
   bla_123bLa_1234567
   

The DELIMITED_TEXT:

   0         1         2         3        
   012345678901234567890123456789012345678
   <!-- 000000000 --><!-- 000000001 -->bla

    4         5         6         7       
   9012345678901234567890123456789012345  
   <!-- 000000003 --><!-- 000000004 -->_

       8         9         0         1    
   678901234567890123456789012345678901234
   <!-- 000000006 --><!-- 000000007 -->123

        2         3         4         5   
   567890123456789012345678901234567890   
   <!-- 000000009 --><!-- 000000010 -->

            6         7         8         
   123456789012345678901234567890123456789
   <!-- 000000011 --><!-- 000000012 -->bLa

   9         0         1         2
   0123456789012345678901234567890123456
   <!-- 000000014 --><!-- 000000015 -->_

      3         4         5         6 
   7890123456789012345678901234567890123456789
   <!-- 000000017 --><!-- 000000018 -->1234567

   7         8         9         0
   012345678901234567890123456789012345
   <!-- 000000020 --><!-- 000000021 -->

As an example, we call now the s method with something similar to:

   s/(bla)_(\d+)/<font color=red>$2</font>_$1/gi

which will be the call:

   $rei->s('(bla)_(\d+)','<font color=red>$2</font>_$1','gi');

the following will happen:

We will use the m// operator to have the match against the cleaned text:

   m/(bla)_(\d+)/i

This will match first with 'bla_123' in the cleaned text. Now we keep the matching variables $& and $1..$9. Then we create the replacer string by substituting those variables in the string:

   '<font color=red>$2</font>_$1'

We will also keep the exact position where the match happened in the cleaned text, and the length of the match.

Using the positions of the start and end of the match, we define a region in the clean text where the match happened, and where the replacer should be placed.

In our example this region is 0 to 6.

We can now use the translation array to translate this region to positions in the delimited text.

We will get the region 36 to 114 in the delimited text.

Now we can get deal with those two regions:

In the clean text it is simple to place the replacer instead of anything that was in that region.

In the delimited text, we will first put all the delimiters in that region together. Then we add the replacer before them, and we place all of this in the region.

Now the only thing we have to do is to fix the translation table - the translation table will not be correct from the start of the matched region, and if the replacer is different in size from the match, also after the matched region.

This is why we use the TRANSLATION_POSITION_FACTOR data member. It keeps the built up difference between the match regions and the replacers while we parse along the text.

The fix of the translation table is boring indexing manipulations. We first fix the region of the replacer to represent the new replacer, and then if there is a difference between the lengths of the match and the replacer, we fix all the indexes after the match.

After we finish to manipulate the text, we build back our text by replacing the delimiters in the delimited text by the tokens that those delimiters represent. This is done by the merge method.

And voila! We get back our text manipulated.

CONSTRUCTOR

new (TEXT, DELIMITER_PATTERN)

Constructs an object of the class. TEXT is the text that we want to parse. DELIMITER_PATTERN is a string that will be used to create delimiters while processing the text. It should contain the string '__INDEX__' that will be replaced by an index, for example: '000000073'.

That delimiter should be chosen to fit the text that should be parsed, and to the get_tokens results. For example for HTML text we can choose '<!-- __INDEX__ -->' or even <__INDEX__>. This might be a good delimiter if our get_tokens takes all the HTML tags as unwanted tokens.

So our choice for a delimiter should be anything that can be used as a delimiter for the "cleaned" text (after the unwanted parts were taken away from the text).

METHODS

get_tokens ( )

This is an abstract method. It should be implemented in a daughter class of this class. Moreover, you will never call this method directly in your code. The split method will call the get_tokens method that you implement.

The method should use the text method to get the text it takes as input. It should return a list of two array references. The first reference refers to a list of all the tokens, and the second reference refers to a list of flags (perl TRUE or FALSE, so one or zero for example). If the flag is FALSE, it means that the token in the other list in the same index is unwanted.

As one example is better then many words, here is an implementation of the get_tokens method that takes all the HTML tags as unwanted parts:

 sub get_tokens {
     my $self = shift;

     my $tokens = [];
     my $flags = [];
     my $index = 0;
     # we should create tokens from the TEXT.
     my $text = $self->text();
     while (defined($text) && 
         # the regular expression will try to match:
         #  - HTML remarks - all the remark will be matched. 
         #  - HTML other tags 
         $text =~ /(<\!\-\-[\s\S]+?\-\->)|(<\/?[^\>]*?>)/i) {
         if ($`) { # if there is text before, take it as clean
             $tokens->[$index] = $`;
             # the text before the match is clean. 
             $flags->[$index] = 1; 
             $index++; # increment the index
         }
         $tokens->[$index] = $&;
         $flags->[$index] = 0; # the match itself is unwanted.
         $index++; # increment the index again
        
         $text = $'; # update the original text to after the match.
     }
 
     # if we are done or we had no match at all, check if there is 
     # still something in the $text. this will be also a clean text.
     if (defined($text) && $text) {
         $tokens->[$index] = $text;
         $flags->[$index] = 1; 
     }
     # return the two list
     return ($tokens, $flags);
 } # of get_tokens

Classes that implement the get_tokens come with this module. Check first if one of them does not implement the get_tokens you need.

And if you feel you wrote a get_tokens that might be useful for the rest of us, please let me know about it.

split ( )

This method should be called before the s or replace methods are called. It will use the get_tokens method to split the text to unwanted tokens and the "cleaned" text. After this method is called the CLEANED_TEXT and the DELIMITED_TEXT data members are available.

s (PATTERN, REPLACEMENT, SWITCHES)

This method implements the perl s// operator while ignoring the unwanted tokens. See the INTRODUCTION section above, and perlop for more details.

You can call this method several times between a call to split and a call to merge.

Important Note: The 'e' and the double 'e' switches are not yet implemented. It is very difficult to implement and maybe impossible without a very sophisticated hack as the method s suppose to see the values of lexical variables in the code that calls that method. I do not know how to do that. If someone has ideas - please contact me or send the patch. Other problem is the way to correctly eval the REPLACEMENT. It is not totally clear to me how to do that correctly. Again - if someone can help - please! Meanwhile, though, you can use the replace method below.

replace ( BUFFER_REF, LAST_POSITION_REF, START_MATCH_POSITION, END_MATCH_POSITION, REPLACER )

The replace method is used by the s method, and usually should not be used directly. However, it might be that the advanced programmer will want to have special manipulation that is done better using the replace. It also gives us a way to by-pass my failure to implement the 'e' and double 'e' switches in the s method.

The replace builds a buffer every time it is called. This buffer is the manipulated cleaned text till the place of the last match and replace. It does not work directly on the CLEANED_TEXT data member in order not to change the cleaned text between the matches (so to gain in performance).

Before we call the replace, we suppose to zero the TANSLATION_POSITION_FACTOR, so previous replaces along the text will not affect the current replaces.

Then we should prepare an empty buffer, and a variable that will hold the position after the last match. This variable should be zero as well.

Now we should send to the replace method a reference to the buffer, a reference to the last position variable, the positions of the start and end of a match in the cleaned text, and a replacer.

The replace method will place the replacer instead of the match, and will build the buffer till the end of the replacer. It will also set the last position variable to the correct value.

Again, example might make it a lot simpler:

      my $name = "Rani";
      ...
      $rei->translation_position_factor(0);
      my $cleaned_text = $rei->cleaned_text();
      my $after_the_matach;
      my $buffer = "";
      my $last_position = 0;
      # for each word
      while ($cleaned_text =~ /$pattern/g) {
          my $match = $&;           
          my $end_match_position = pos($cleaned_text) - 1;
          my $match_length = length($match);
          my $start_match_position = 
              $end_match_position - $match_length + 1;
          # as an example we call a function 
          my $replacer = func($name, $2, $1);
          $rei->replace(\$buffer,
                        \$last_position,
                        $start_match_position,
                        $end_match_position,
                        $replacer);
      }
      $buffer .= substr($rei->cleaned_text(), $last_position);
      $rei->cleaned_text($buffer);

This will actually do the same as calling the s method like this:

      s/$pattern/&func($name,$2,$1)/ge;

Of course the replace method can be more useful in other cases. For example, if we change our regular expression in the above while block. Or if , before the while block, we copy part of the CLEANED_TEXT to the buffer and set the last position variable accordingly in order to start to match from the middle of the CLEANED_TEXT.

merge ( )

This method will build back our text from the manipulated CLEANED_TEXT and the unwanted tokens. It saves the resulted text in the TEXT data member and also returns it.

ACCESS METHODS

text ( TEXT )

Represents the text we input in order to manipulate, and the resulted text we get after we had the manipulations and merged.

delimited_text ( )

Represents the "cleaned" text after we called the split method, with delimiters that represent the unwanted tokens.

cleaned_text ( )

Represents the "cleaned" text after we called the split method and took out the unwanted parts.

delimiter_pattern ( DELIMITER_PATTERN )

Represents the DELIMITER_PATTERN data member. See the CONSTRUCTOR for more details.

BUGS AND OTHER PROBLEMS

Who knows?!? You should tell me. Please!

I guess there are bugs because this module is new - a baby module - that was created in the holidays of the end of 2001. And also because the algorithm that is implemented in it is not simple for me.

Besides, I am quite certain it does not perform as you expect. So, part of this problem is in your expectations ;-) This module come to kill a huge problem, that if you try to solve it other way, it will probably perform less good (and if not - tell me how you do it!). However, many parts in it can be for sure implemented differently to give better performances. Please - let me know what you think, send me patches or ideas.

AUTHOR

Rani Pinchuk, <rani@cpan.org>

COPYRIGHT

Copyright (c) 2002 Ockham Technology N.V. & Rani Pinchuk. All rights reserved. This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

perl, perlop, perlre.