The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DataFlow - A framework for dataflow processing

VERSION

version 1.121830

SYNOPSIS

        use DataFlow;

        my $flow = DataFlow->new(
                procs => [
                    DataFlow::Proc->new( p => sub { do this thing } ), # a Proc
                        sub { ... do something },   # a code ref
                        'UC',                       # named Proc
                        [                           # named Proc, with parameters
                          CSV => {
                                direction     => 'CONVERT_TO',
                                text_csv_opts => { binary => 1 },
                          }
                        ],
                        # named Proc, named "Proc"
                        [ Proc => { p => sub { do this other thing }, deref => 1 } ],
                        DataFlow->new( ... ),       # another flow
                ]
        );

        $flow->input( <some input> );
        my $output = $flow->output();

        my $output = $flow->output( <some other input> );

        # other ways to invoke the constructor
        my $flow = DataFlow->new( sub { .. do something } );   # pass a sub
        my $flow = DataFlow->new( [                            # pass an array
                sub { ... do this },
                'UC',
                [
                  HTMLFilter => (
                    search_xpath => '//td',
                        result_type  => 'VALUE'
                  )
                ]
        ] );
        my $flow = DataFlow->new( $another_flow ); # pass another DataFlow or Proc

        # other way to pass the data through
        my $output = $flow->process( qw/long list of data/ );

DESCRIPTION

A DataFlow object is able to accept data, feed it into an array of processors (DataFlow::Proc objects), and return the result(s) back to the caller.

ATTRIBUTES

name

(Str) A descriptive name for the dataflow. (OPTIONAL)

default_channel

(Str) The name of the default communication channel. (DEFAULT: 'default')

auto_process

(Bool) If there is data available in the output queue, and one calls the output() method, this attribute will flag whether the dataflow should attempt to automatically process queued data. (DEFAULT: true)

procs

(ArrayRef[DataFlow::Role::Processor]) The list of processors that make this DataFlow. Optionally, you may pass CodeRefs that will be automatically converted to DataFlow::Proc objects. (REQUIRED)

The procs parameter will accept some variations in its value. Any ArrayRef passed will be parsed, and additionaly to plain DataFlow::Proc objects, it will accept: DataFlow objects (so one can nest flows), code references (sub{} blocks), array references and plain text strings. Refer to DataFlow::Types for more information on these different forms of passing the procs parameter.

Additionally, one may pass any of these forms as a single argument to the constructor new, plus a single DataFlow, or DataFlow:Proc or string.

METHODS

has_queued_data

Returns true if the dataflow contains any queued data within.

clone

Returns another instance of a DataFlow using the same array of processors.

input

Accepts input data for the data flow. It will gladly accept anything passed as parameters. However, it must be noticed that it will not be able to make a distinction between arrays and hashes. Both forms below will render the exact same results:

        $flow->input( qw/all the simple things/ );
        $flow->input( all => 'the', simple => 'things' );

If you do want to handle arrays and hashes differently, we strongly suggest that you use references:

        $flow->input( [ qw/all the simple things/ ] );
        $flow->input( { all => the, simple => 'things' } );

Processors using the DataFlow::Policy::ProcessInto policy (default) will process the items inside an array reference, and the values (not the keys) inside a hash reference.

channel_input

Accepts input data into a specific channel for the data flow:

        $flow->channel_input( 'mydatachannel', qw/all the simple things/ );

process_input

Processes items in the array of queues and place at least one item in the output (last) queue. One will typically call this to flush out some unwanted data and/or if auto_process has been disabled.

output_items

Fetches items, more specifically objects of the type DataFlow::Item, from the data flow. If called in scalar context it will return one processed item from the flow. If called in list context it will return all the items from the last queue.

output

Fetches data from the data flow. It accepts a parameter that points from which data channel the data must be fetched. If no channel is specified, it will default to the 'default' channel. If called in scalar context it will return one processed item from the flow. If called in list context it will return all the elements in the last queue.

reset

Clears all data in the dataflow and makes it ready for a new run.

flush

Flushes all the data through the dataflow, and returns the complete result set.

process

Immediately processes a bunch of data, without touching the object queues. It will process all the provided data and return the complete result set for it.

proc_by_index

Expects an index (Num) as parameter. Returns the index-th processor in this data flow, or undef otherwise.

proc_by_name

Expects a name (Str) as parameter. Returns the first processor in this data flow, for which the name attribute has the same value of the name parameter, or undef otherwise.

FUNCTIONS

dataflow

Syntax sugar function that can be used to instantiate a new flow. It can be used like this:

        my $flow = dataflow
                [ 'Proc' => p => sub { ... } ],
                ...
                [ 'CSV' => direction => 'CONVERT_TO' ];

        $flow->process('bananas');

HISTORY

This is a framework for data flow processing. It started as a spin-off project from the OpenData-BR initiative.

As of now (Mar, 2011) it is still a 'work in progress', and there is a lot of progress to make. It is highly recommended that you read the tests, and the documentation of DataFlow::Proc, to start with.

An article has been recently written in Brazilian Portuguese about this framework, per the São Paulo Perl Mongers "Equinócio" (Equinox) virtual event. Although an English version of the article in in the plans, you can figure a good deal out of the original one at

http://sao-paulo.pm.org/equinocio/2011/mar/5

UPDATE: DataFlow is a fast-evolving project, and this article, as it was published there, refers to versions 0.91.x of the framework. There has been a big refactor since then and, although the concept remains the same, since version 0.950000 the programming interface has been changed violently.

Any doubts, feel free to get in touch.

SUPPORT

Perldoc

You can find documentation for this module with the perldoc command.

  perldoc DataFlow

Websites

The following websites have more information about this module, and may be of help to you. As always, in addition to those websites please use your favorite search engine to discover more resources.

Email

You can email the author of this module at RUSSOZ at cpan.org asking for help with any problems you have.

Internet Relay Chat

You can get live help by using IRC ( Internet Relay Chat ). If you don't know what IRC is, please read this excellent guide: http://en.wikipedia.org/wiki/Internet_Relay_Chat. Please be courteous and patient when talking to us, as we might be busy or sleeping! You can join those networks/channels and get help:

  • irc.perl.org

    You can connect to the server at 'irc.perl.org' and join this channel: #sao-paulo.pm then talk to this person for help: russoz.

Bugs / Feature Requests

Please report any bugs or feature requests by email to bug-dataflow at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=DataFlow. You will be automatically notified of any progress on the request by the system.

Source Code

The code is open to the world, and available for you to hack on. Please feel free to browse it and play with it, or whatever. If you want to contribute patches, please send me a diff or prod me to pull from your repository :)

https://github.com/russoz/DataFlow

  git clone https://github.com/russoz/DataFlow.git

AUTHOR

Alexei Znamensky <russoz@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2011 by Alexei Znamensky.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

BUGS AND LIMITATIONS

You can make new bug reports, and view existing ones, through the web interface at http://rt.cpan.org.

DISCLAIMER OF WARRANTY

BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.