Author image Thomas Orgis
and 1 contributors

NAME

Text::ASCIIPipe - helper for processing multiple text files in a stream (through a pipe, usually)

SYNOPSIS

        use Text::ASCIIPipe;

        # The hooks get the current line as argument $_[0].
        # This is printed by Text::ASCIIPipe::process after the hook returns.
        # Change it to your liking, including setting to '' for suppressing output.
        sub line_hook
        {
                $_[0] = "A line: ".$_[0];
        }

        # For the delimiter hooks, the line consists of the control character,
        # so the final output may want to suppress it -- but not so if the output
        # goes to another pipe processor.
        sub begin_hook  { $_[0] = "New file began here!\n"; }
        sub end_hook    { $_[0] = "A file ended.\n";        }
        sub allend_hook { $_[0] = "End of transmission\n";  }

        my $line;
        # Bare usage without callback hooks, using STDIN.
        while(defined (my $state = Text::ASCIIPipe::fetch(undef, $line)))
        {
                line_hook($line)   if ($state == $Text::ASCIIPipe::line);
                begin_hook($line)  if ($state == $Text::ASCIIPipe::begin);
                end_hook($line)    if ($state == $Text::ASCIIPipe::end);
                allend_hook($line) if ($state == $Text::ASCIIPipe::allend);
                print $line;
                # End of transmission is not exactly the same as stream end.
                # But mostly so.
                last if ($state == $Text::ASCIIPipe::allend);
        }

        # Processes STDIN to STDOUT, very similar to the above code.
        # You can set any hook to undef to disable it.
        Text::ASCIIPipe::process
        (
                 begin  => \&begin_hook
                ,line   => \&line_hook
                ,end    => \&end_hook
                ,allend => \&allend_hook
        );

        # Processes given file handle.
        my $fh;
        open($fh, '<', $dump_of_a_text_data_stream);
        Text::ASCIIPipe::process
        (
                 in     => $fh
                ,out    => \*STDOUT # or undef, or some other file
                ,begin  => \&begin_hook
                ,line   => \&line_hook
                ,end    => \&end_hook
                ,allend => \&allend_hook
                ,flush  => 0  # Default is 1 (see below).
        );

        # The other side of the pipe can push serveral files...
        # Per default to STDOUT.

        # Just shove one whole file through (default: STDIN -> STDOUT).
        my $from
        my $to;
        open($from, '<', $some_filename);
        open($to, '|-', $some_command); # A pipe is what makes most sense...
        # Remember: $to can always be undef for STDOUT.
        Text::ASCIIPipe::push_file($from, $to);

        # Pull a file from Pipe (STDIN in this case) into given handle.
        open(my $out_fh, '>', $some_filename);
        my $fetch_count = Text::ASCIIPipe::pull_file(undef, $out_fh);
        print "Seems like something came through.\n" if($fetch_count > 0);

        # Detailed API.
        Text::ASCIIPipe::file_begin($to); # Send begin marker.
        # Send line(s) of file.
        Text::ASCIIPipe::file_lines($to, "#header\n", "1 2 3\n");
        Text::ASCIIPipe::file_end($to);   # Send end marker.

        # After sending all files, send total end marker (allend).
        # Just closing the sink does the  trick, too.
        Text::ASCIIPipe::done($to);

DESCRIPTION

A lot of the speed penalty of Perl when processing multiple smallish data sets from/to text form in a shell loop consists of the repeated perl compiler startup / script compilation, which accumulates when looping over a set of files. This process can be sped up a lot by keeping the pipe alive and streaming the whole file set through it once. This module helps you with that. Of course, a pipe of several scripts parsing/producing text will still be slower than a custom C program that does the job, but with this trick of avoiding repeated script interpretation/compilation, the margin is a lot smaller.

When dealing with ASCII-based text files (or UTF-8, if you please), there are some control characters that just make sense for pushing several files as a stream, separated by these characters. These are character codes 2 (STX, start of text), 3 (EOT, end of text) and 4 (ETX, end of transmission). All this module does is provide a wrapper for inserting these control characters for the sender and parsing them for the receiver. Nothing fancy, really. I just got fed up writing the same loop over and over again. It works with all textual data that does not contain control characters below decimal code 5.

The process() function itself tries to employ a bit of smartness regarding buffering of the output. Since the actual operation of multiple ASCIIPipe-using programs in a, well, pipe, might conflict with the default buffering of the output stream (STDOUT), process() disables buffering on the output whenever it encounters the first STX. This mirrors the code this module has been pulled from: It made sense there, enabling the last consumer in the pipe to get the end of a file in time and act on that information. This behaviour can be turned off by giving flush=>0 as parameter.

FUNCTIONS

This module offers a simple procedural interface built by the following stateless functions:

fetch
        $state = Text::ASCIIPipe::fetch($in_handle, $line);

Tries to fetch a line of text from given input handle (STDIN if undef), storing data in $line. Return value corresponds to one of those states: undef for no data being there (unannounced EOF), $Text::ASCIIPipe::begin for file begin marker, $Text::ASCIIPipe::end for file end marker, $Text::ASCIIPipe::allend for final end marker and, finally, $Text::ASCIIPipe::line if you actually fetched a line of content.

plaintext
        $not_special = Text::ASCIIPipe::plaintext($line);

Returns 1 if the given data does not start with one of the control codes that Text::ASCIIPipe interprets (could contain other control codes, though).

process

Proccess a text file pipe, slurping through a stream of files. See SYNOPSYS for usage.

pull_file

Pull a single file from the pipe. See SYNOPSYS for usage.

push_file

Push a single file to the pipe. See SYNOPSYS for usage.

file_begin

Send file begin marker. See SYNOPSYS for usage.

file_lines

Send file contents. See SYNOPSYS for usage.

file_end

Send file end marker. See SYNOPSYS for usage.

done

Send overall end marker. See SYNOPSYS for usage.

TODO

Got to figure out if the business about autoflushing is right, and improve it.

SEE ALSO

This idea is too obvious. This must have been implemented a number of times already. Yet, I did not find an instance of this on CPAN.

AUTHOR

Thomas Orgis <thomas@orgis.org>

COPYRIGHT AND LICENSE

Copyright (C) 2011-2012, Thomas Orgis.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0. For more details, see the full text of the licenses in the directory LICENSES.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.