The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

String::REPartition - Generates a regex to partition a data set

SYNOPSIS

  use String::REPartition;
  use strict;

  my($regex) = make_partition_re(0.5, \@some_really_big_list_of_strings);
  if ($string =~ /$regex/) {
    # $string is in the first half of the data
  } else {
    # $string is in the second half
  }

DESCRIPTION

This module exports a single function -- make_partition_re. It takes as its first argument a number between 0 and 1, representing a percentage, and as its second argument a reference to a list of strings. It returns a regular expression which is guaranteed to match the percentage of the strings in the list represented by the number in the first argument. More importantly, the regex returned will *not* to match the rest of the string in the list. That is, if the inputs were '0.6' and a reference to a list of 100 strings, the returned regex would match 60 of the strings in the list and not match the other 40.

Keep in mind that, since only integer operations may be performed on these strings, (that is, there cannot be a regex which matches a fraction of a string), the target number is rounded down. If you have 4 strings in your list and a ratio of 0.4, the resulting regex will match 1 string, not 1.6 strings. More interestingly, with 4 strings and a ratio of 0.1, the resulting regex will almost certainly be /^()$/ -- matching exactly 0 of the strings in the list. Furthermore, because of this rounding, the returned regex may not match precisely the number expected by multiplying the size of the list by the ratio, but instead be off by a small number in either direction.

c<make_partition_re()> will return c<undef> on a failure, and print a warning to STDERR if $^W is true. Currently, the only errors that can occur relate directly to the validity of the inputs. Furthermore, if the strings in the list are not unique, the behaviour of this function is not defined. For a small amout of repetition the regex should still work, but it should be clear that a solution cannot be found if the input list consists only of many copies of the same string.

The function finds its solution in roughly O(N) time -- however, in worst cases, I think it can get as high as O(N^2). It's also true that certain types of pathologically constructed data sets can break things and cause it either to return an invalid regex or to enter an infinite loop. While I haven't run into any of this in my testing, I'm not confident that I've tested every possibility.

So why would you want to use this? Well, that's a question you'll have to answer for yourself, mostly. :) However, the situations I envisaged while developing the module were sort of like this: Imagine you have a large set of data, indexed by a correspondingly large set of keywords. Let's say you want to split this data into two partitions, perhaps in order to store the data in two separate locations. Maintaining a complete list of the remote keys could be expensive -- instead, you can simply store a regular expression which matches the keys you keep remotely and does not match the local ones.

Another interesting feature is that a regex generated from a sufficiently large subset of your data will, approximately, match the appropriate percentage of strings from the complete data set. This means that you do not need to have all of the data before you generate a regex to partition it. As an example, generating a regex from roughly 10% of the words in /usr/dict/words (selected randomly) gave me a regex that matched within .3% of the desired result for all of the words.

Methods

make_partition_re

See description for details.

Future Work

  • As I indicate above, this module does not perform properly when data with non-unique strings is given to it. I did not feel it was reasonable to take the time to check the list for uniqueness, so it will happily process data for which it may not be able to generate a valid solution. This may change in future versions.

  • Internal rounding is a problem in some cases. The sections of the code which are most responsible for this were rather hastily conceived and no doubt could be somewhat improved. However, they are also somewhat confusing and I need to make this release in order to fix a few bugs. The rounding problem will be addressed in the next release.

AUTHOR

Copyright 2007 Avi Finkel <avi@finkel.org>

This package is free software and is provided "as is" without express or implied warranty. It may be used, redistributed and/or modified under the terms of the Perl Artistic License (see http://www.perl.com/perl/misc/Artistic.html)