++ed by:
POTATOGIM XYF TMURRAY DEMIAN DANAJ

24 PAUSE users
20 non-PAUSE users.

Mario Roy
and 1 contributors

NAME

MCE::Grep - Parallel grep model similar to the native grep function

VERSION

This document describes MCE::Grep version 1.800

SYNOPSIS

   ## Exports mce_grep, mce_grep_f, and mce_grep_s
   use MCE::Grep;

   ## Array or array_ref
   my @a = mce_grep { $_ % 5 == 0 } 1..10000;
   my @b = mce_grep { $_ % 5 == 0 } [ 1..10000 ];

   ## File_path, glob_ref, or scalar_ref
   my @c = mce_grep_f { /pattern/ } "/path/to/file";
   my @d = mce_grep_f { /pattern/ } $file_handle;
   my @e = mce_grep_f { /pattern/ } \$scalar;

   ## Sequence of numbers (begin, end [, step, format])
   my @f = mce_grep_s { %_ * 3 == 0 } 1, 10000, 5;
   my @g = mce_grep_s { %_ * 3 == 0 } [ 1, 10000, 5 ];

   my @h = mce_grep_s { %_ * 3 == 0 } {
      begin => 1, end => 10000, step => 5, format => undef
   };

DESCRIPTION

This module provides a parallel grep implementation via Many-Core Engine. MCE incurs a small overhead due to passing of data. A fast code block will run faster natively. However, the overhead will likely diminish as the complexity increases for the code.

   my @m1 =     grep { $_ % 5 == 0 } 1..1000000;          ## 0.065 secs
   my @m2 = mce_grep { $_ % 5 == 0 } 1..1000000;          ## 0.194 secs

Chunking, enabled by default, greatly reduces the overhead behind the scene. The time for mce_grep below also includes the time for data exchanges between the manager and worker processes. More parallelization will be seen when the code incurs additional CPU time.

   my @m1 =     grep { /[2357][1468][9]/ } 1..1000000;    ## 0.353 secs
   my @m2 = mce_grep { /[2357][1468][9]/ } 1..1000000;    ## 0.218 secs

Even faster is mce_grep_s; useful when input data is a range of numbers. Workers generate sequences mathematically among themselves without any interaction from the manager process. Two arguments are required for mce_grep_s (begin, end). Step defaults to 1 if begin is smaller than end, otherwise -1.

   my @m3 = mce_grep_s { /[2357][1468][9]/ } 1, 1000000;  ## 0.165 secs

Although this document is about MCE::Grep, the MCE::Stream module can write results immediately without waiting for all chunks to complete. This is made possible by passing the reference to an array (in this case @m4 and @m5).

   use MCE::Stream default_mode => 'grep';

   my @m4; mce_stream \@m4, sub { /[2357][1468][9]/ }, 1..1000000;

      ## Completed in 0.203 secs. This is amazing considering the
      ## overhead for passing data between the manager and workers.

   my @m5; mce_stream_s \@m5, sub { /[2357][1468][9]/ }, 1, 1000000;

      ## Completed in 0.120 secs. Like with mce_grep_s, specifying a
      ## sequence specification turns out to be faster due to lesser
      ## overhead for the manager process.

A common scenario is grepping for pattern(s) inside a massive log file. Notice how parallelism increases as complexity increases for the pattern. Testing was done against a 300 MB file containing 250k lines.

   use MCE::Grep;

   my @m; open my $LOG, "<", "/path/to/log/file" or die "$!\n";

   @m = grep { /pattern/ } <$LOG>;                      ##  0.756 secs
   @m = grep { /foobar|[2357][1468][9]/ } <$LOG>;       ## 24.681 secs

   ## Parallelism with mce_grep. This involves the manager process
   ## due to processing a file handle.

   @m = mce_grep { /pattern/ } <$LOG>;                  ##  0.997 secs
   @m = mce_grep { /foobar|[2357][1468][9]/ } <$LOG>;   ##  7.439 secs

   ## Even faster with mce_grep_f. Workers access the file directly
   ## with zero interaction from the manager process.

   my $LOG = "/path/to/file";
   @m = mce_grep_f { /pattern/ } $LOG;                  ##  0.112 secs
   @m = mce_grep_f { /foobar|[2357][1468][9]/ } $LOG;   ##  6.840 secs

PARSING HUGE FILES

The MCE::Grep module lacks an optimization for quickly determining if a match is found from not knowing the pattern inside the code block. Use the following snippet as a template to achieve better performance. Also, take a look at examples/egrep.pl, included with the distribution.

   use MCE::Loop;

   MCE::Loop::init {
      max_workers => 8, use_slurpio => 1
   };

   my $pattern  = 'karl';
   my $hugefile = 'very_huge.file';

   my @result = mce_loop_f {
      my ($mce, $slurp_ref, $chunk_id) = @_;

      ## Quickly determine if a match is found.
      ## Process slurped chunk only if true.

      if ($$slurp_ref =~ /$pattern/m) {
         my @matches;

         ## The following is fast on Unix. Performance degrades
         ## drastically on Windows beyond 4 workers.

         open my $MEM_FH, '<', $slurp_ref;
         binmode $MEM_FH, ':raw';
         while (<$MEM_FH>) { push @matches, $_ if (/$pattern/); }
         close   $MEM_FH;

         ## Therefore, use the following construct on Windows.

         while ( $$slurp_ref =~ /([^\n]+\n)/mg ) {
            my $line = $1; # save $1 to not lose the value
            push @matches, $line if ($line =~ /$pattern/);
         }

         ## Gather matched lines.

         MCE->gather(@matches);
      }

   } $hugefile;

   print join('', @result);

OVERRIDING DEFAULTS

The following list options which may be overridden when loading the module.

   use Sereal qw( encode_sereal decode_sereal );
   use CBOR::XS qw( encode_cbor decode_cbor );
   use JSON::XS qw( encode_json decode_json );

   use MCE::Grep
       max_workers => 4,                # Default 'auto'
       chunk_size => 100,               # Default 'auto'
       tmp_dir => "/path/to/app/tmp",   # $MCE::Signal::tmp_dir
       freeze => \&encode_sereal,       # \&Storable::freeze
       thaw => \&decode_sereal          # \&Storable::thaw
   ;

There is a simpler way to enable Sereal. The following will attempt to use Sereal if available, otherwise defaults to Storable for serialization.

   use MCE::Grep Sereal => 1;

From MCE 1.800 onwards, this is done automatically if Sereal 3.008 or later is installed. Specify Sereal => 0 if Storable is desired.

CUSTOMIZING MCE

MCE::Grep->init ( options )
MCE::Grep::init { options }

The init function accepts a hash of MCE options. The gather option, if specified, is ignored due to being used internally by the module.

   use MCE::Grep;

   MCE::Grep::init {
      chunk_size => 1, max_workers => 4,

      user_begin => sub {
         print "## ", MCE->wid, " started\n";
      },

      user_end => sub {
         print "## ", MCE->wid, " completed\n";
      }
   };

   my @a = mce_grep { $_ % 5 == 0 } 1..100;

   print "\n", "@a", "\n";

   -- Output

   ## 2 started
   ## 3 started
   ## 1 started
   ## 4 started
   ## 3 completed
   ## 4 completed
   ## 1 completed
   ## 2 completed

   5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

API DOCUMENTATION

MCE::Grep->run ( sub { code }, iterator )
mce_grep { code } iterator

An iterator reference can by specified for input_data. Iterators are described under "SYNTAX for INPUT_DATA" at MCE::Core.

   my @a = mce_grep { $_ % 3 == 0 } make_iterator(10, 30, 2);
MCE::Grep->run ( sub { code }, list )
mce_grep { code } list

Input data can be defined using a list.

   my @a = mce_grep { /[2357]/ } 1..1000;
   my @b = mce_grep { /[2357]/ } [ 1..1000 ];
MCE::Grep->run_file ( sub { code }, file )
mce_grep_f { code } file

The fastest of these is the /path/to/file. Workers communicate the next offset position among themselves without any interaction from the manager process.

   my @c = mce_grep_f { /pattern/ } "/path/to/file";
   my @d = mce_grep_f { /pattern/ } $file_handle;
   my @e = mce_grep_f { /pattern/ } \$scalar;
MCE::Grep->run_seq ( sub { code }, $beg, $end [, $step, $fmt ] )
mce_grep_s { code } $beg, $end [, $step, $fmt ]

Sequence can be defined as a list, an array reference, or a hash reference. The functions require both begin and end values to run. Step and format are optional. The format is passed to sprintf (% may be omitted below).

   my ($beg, $end, $step, $fmt) = (10, 20, 0.1, "%4.1f");

   my @f = mce_grep_s { /[1234]\.[5678]/ } $beg, $end, $step, $fmt;
   my @g = mce_grep_s { /[1234]\.[5678]/ } [ $beg, $end, $step, $fmt ];

   my @h = mce_grep_s { /[1234]\.[5678]/ } {
      begin => $beg, end => $end, step => $step, format => $fmt
   };

MANUAL SHUTDOWN

MCE::Grep->finish
MCE::Grep::finish

Workers remain persistent as much as possible after running. Shutdown occurs automatically when the script terminates. Call finish when workers are no longer needed.

   use MCE::Grep;

   MCE::Grep::init {
      chunk_size => 20, max_workers => 'auto'
   };

   my @a = mce_grep { ... } 1..100;

   MCE::Grep::finish;

INDEX

MCE, MCE::Core

AUTHOR

Mario E. Roy, <marioeroy AT gmail DOT com>