The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Iterator::Flex::Manual::Authoring - How to write an iterator

VERSION

version 0.18

DESCRIPTION

Iterator Phases

Iterators must manage the four different phases that the iterator might be in:

  • initialization

  • iteration

  • exhaustion

  • error

For more details, see "Iterator life-cycle" in Iterator::Flex::Manual::Overview.

Initialization

When an iterator is constructed it is typically passed some state information; it may be an array or a hash, a database handle, or a file pointer.

The constructor must save the relevant pieces of information (typically through closed-over variables; see below) and initialize variables which keep track of where the iterator is in the data stream.

For example, if the iterator operates on an array, it will need to keep track of the index of the element it must return next.

Some iterators don't need or have access to that information. If an iterator operates on a file handle, returning the next line in the file, the file handle keeps track of the next line in the file, so the iterator doesn't need to. Similarly, if an iterator is retrieving data from a database via a cursor, the database will keep track of where it is in the data stream.

Iteration

In the iteration phase, the iterator identifies the data to return, updates its internal state (if necessary) so that it will return the correct data on the next iteration, and returns the data.

If the data stream has been exhausted, then the iterator must indicate this by calling the signal_exhaustion method. This method implements the exhaustion policy requested by the user who set up the iterator (either returning a sentinel value or throwing an exception.

After this the iterator enters the "Exhaustion" phase.

If there is an error (e.g. if a database connection is dropped), the iterator must signal this by calling the signal_error method. Not all iterators have an error phase.

Exhaustion

Unlike other iteration implementations, it is legal to call an iterator's next method after the iterator is exhausted. In the exhaustion phase, the iterator simply invokes the signal_exhaustion method

Error

Not all iterators have an error phase, but if they are in one, they simply call signal_error.

Capabilities

An iterator must do at least one thing: return the next datum from the data stream. This is the next capability. Iterator::Flex iterators can support a number of other capabilities; see in Iterator::Flex::Manual::Overview,

Building an Iterator

Iterators are constructed by passing an attribute hash, %AttrHash to the Iterator::Flex factory, which uses it to construct an appropriate iterator class, instantiate it, and return it to the user.

The attribute hash (whose contents are documented much greater detail in "Iterator Parameters" in Iterator::Flex::Manual::Overview) describes the iterator's capabilities and provides implementations.

The main iterator routine (next) must be a closure, with state contained in closed over variables. Every time a new iterator is constructed, a new closure is generated.

Writing an iterator generally involves writing a subroutine which returns the %AttrHash containing the closures. As an example, we will construct an iterator which operates on arrays, providing a number of capabilities.

For simplicity, we'll write a construct subroutine which is passed a reference to the array to iterator over, and returns the %AttrHash. Later we'll see how to create a one-off iterator and a standalone iterator class using the concepts we've explored.

Our construct subroutine will be called as

 $ref_AttrHash = construct( \@array );

Creating the next capability

First, let's concentrate on the heart of the iterator, the next capability, which must be implemented as a closure.

next has three responsibilities:

  • return the next data element

  • signal exhaustion

  • (optionally) signal an error.

It usually also ensures that the current and previous capabilities return the proper values. Because it is called most often, it should be as efficient as possible.

next cannot keep state internally. Our construct subroutine will store the state in lexical variables which only our instance of next will have access to.

To illustrate, here's an implementation of next for iteration over an array:

 my $next = sub {
     if ( $next == $len ) {
         # if first time through, set current
         $prev = $current
           if ! $self->is_exhausted;
         return $current = $self->signal_exhaustion;
     }
     $prev    = $current;
     $current = $next++;

     return $arr->[$current];
 };

Notice that the subroutine doesn't take any parameters. Also notice that it uses a number of variables that are not defined in the subroutine, e.g. $arr, $next, etc. These are lexical variables in configure and are initialized outside of the $next closure.

$arr is the array we're operating on, $len is its length (so we don't have to look it up every time). Because it's cheap to retain the state of an array (it's just an index), we can easily keep track of what is needed to implement the prev and current capabilities; those are stored in $prev and $current.

Finally, there's $self, which is a handle for our iterator. It's not used for any performance critical work.

These must all be properly initialized by construct before $next is created; we'll go over that later. Let's first look at the code for the $next closure.

The code is divided into two sections; the first deals with data exhaustion>:

     if ( $next == $len ) {
         # if first time through, set prev
         $prev = $current
           if ! $self->is_exhausted;
         return $current = $self->signal_exhaustion;
     }

Every time the iterator is invoked, it checks if it has run out of data. If it is has (e.g. $next == $len) then the iterator sets up the exhaustion phase. The is_exhausted predicate will be true if the iterator is already in the exhaustion phase. If it is, it doesn't need to perform work required to handle other capabilities. In our case, the first time the iterator is in the exhausted state it must set $prev so that it correctly returns the last element in the array (which will be $current from the last successful iteration).

Then, it signals exhaustion by returning its signal_exhaustion method (and setting $current to that value, so the current capability will return the correct value).

Recall that it is the client code that determines how the iterator will signal exhaustion (i.e, via a sentinel value or an exception). The iterator itself doesn't care; it simply returns the result of the signal_exhaustion method, which will set the is_exhausted object predicate and then either return a sentinel value or throw an exception.

In other iterator implementations (e.g. C++, Raku), calling next (or other methods) on an exhausted iterator is undefined behavior. This is not true for Iterator::Flex iterators. An exhausted iterator must always respond, identically, to a call to next, so must always return the result of the signal_exhaustion method.

The second part of the code takes care of returning the correct data and setting the iterator up for the succeeding call to next. It also ensures that the current and prev capabilities will return the proper values:

     $prev    = $current;
     $current = $next++;

     return $arr->[$current];

Other capabilities

For completeness, here's the implementation of the rest of the iterator's capabilities:

 my $reset   = sub { $prev = $current = undef;  $next = 0; };
 my $rewind  = sub { $next = 0; };
 my $prev    = sub { return defined $prev ? $arr->[$prev] : undef; };
 my $current = sub { return defined $current ? $arr->[$current] : undef; };

They have been written as closures accessing the lexical variables, but they could also have been written as methods if the iterator chose to store its state in some other fashion. Only next must be a closure.

Initialization Phase

Finally, we'll get to the iterator initialization phase, which may make more sense now that we've gone through the other phases. Recall that we are using closed over variables to keep track of state.

Our code should look something like this:

  sub construct ( $array ) {

    # initialize lexical variables here
    my $next = ...;
    my $prev = ...;
    my $current = ...;
    my $arr = ...;
    my $len = ...;

    my $self = ...;

    # create our closures
    my $next = sub { ... };
    my $prev = sub { ... };
    ...

    # return our %AttrHash:
    return {
             _self => \$self,
              next => $next,
              prev => $prev,
           current => $current,
             reset => $reset,
            rewind => $rewind,
    };
  }

The first five lexical variables are easy:

  my $next = 0;
  my $prev = undef;
  my $current = undef;
  my $arr = $array ;
  my $len = $array->@*;

Now, what about $self? It is a reference to our iterator object, but the object hasn't be created yet; that's done when %AttrHash is passed to Iterator::Flex::Factory. So where does $self get initialized? The answer lies in the _self entry in %AttrHash, which holds a reference to $self. When Iterator::Flex::Factory creates the iterator object it uses the _self entry to initialize $self. (Note that $self is not a reference to a hash. You cannot store data in it.)

Wrapping up

At this point construct is functionally complete; given an array it'll return a hash that can be fed to the iterator factory.

Passing the %AttrHash to the factory

Iterators may be constructed on-the-fly, or may be formalized as classes.

A one-off iterator

This approach uses "construct_from_attrs" in Iterator::Flex::Factory to create an iterator object from our %AttrHash:

  my @array = ( 1..100 );
  my $AttrHash = construct( \@array );
  $iter = Iterator::Flex::Factorye->construct_from_attrs( $AttrHash, \%opts );

In addition to %AttrHash, construct_from_attrs takes another options hash, which is where the exhaustion policy is set.

In this case, we can choose one of the following entries

  • exhaustion => 'throw';

    On exhaustion, throw an exception object of class Iterator::Flex::Failure::Exhausted.

  • exhaustion => [ return => $sentinel ];

    On exhaustion, return the specified sentinel value.

The default is

  exhaustion => [ return => undef ];

At this point $iter is initialized and ready for use.

An iterator class

Creating a class requires a few steps more, and gives the following benefits:

  • A much cleaner interface, e.g.

      $iter = Iterator::Flex::Array->new( \@array );

    vs. the multi-liner above.

  • The ability to freeze and thaw the iterator

  • some of the construction costs can be moved from run time to compile time.

An iterator class must

  • subclass Iterator::Flex::Base;

  • provide two class methods, new and construct; and

  • register its capabilities.

new

The new method converts from the API most comfortable to your usage to the internal API used by Iterator::Flex::Base. By convention, the last argument should be reserved for a hashref containing general iterator arguments (such as the exhaustion key). This hashref is documented in "new_from_attrs" in Iterator::Flex::Base.

The super class' constructor takes two arguments: a variable containing iterator specific data (state), and the above-mentioned general argument hash. The state variable can take any form, it is not interpreted by the Iterator::Flex framework.

Here's the code for "new" in Iterator::Flex::Array:

  sub new ( $class, $array, $pars={} ) {
      $class->_throw( parameter => "argument must be an ARRAY reference" )
        unless Ref::Util::is_arrayref( $array );
      $class->SUPER::new( { array => $array }, $pars );
  }

It's pretty simple. It saves the general options hash if present, stores the passed array (the state) in a hash, and passes both of them to the super class' constructor. (A hash is used here because Iterator::Flex::Array can be serialized, and extra state is required to do so).

construct

The construct class method's duty is to return a %AttrHash. It's called as

  $AttrHash = $class->construct( $state );

where $state is the state variable passed to "new" in Iterator::Flex::Base. Unsurprisingly, it is remarkably similar to the construct subroutine developed earlier.

There are a few differences:

  • The signature changes, as this is a class method, rather than a subroutine.

  • There are additional %AttrHash entries available: _roles, which supports run-time enabling of capabilities and freeze, which supports serialization.

  • Capabilities other than next can be implemented as actual class methods, rather than closures. This decreases the cost of creating iterators (because they only need to be compiled once, rather than for every instance of the iterator) but increases run time costs, as they cannot use closed over variables to access state information.

Registering Capabilities

Unlike when using "construct_from_attr" in Iterator::Flex::Factory, which helpfully looks at %AttrHash to determine which capabilities are provided (albeit at run time), classes are encouraged to register their capabilities at compile time via the _add_roles method. For the example iterator class, this would be done via

  __PACKAGE__->_add_roles( qw[
        State::Registry
        Next::ClosedSelf
        Rewind::Closure
        Reset::Closure
        Prev::Closure
        Current::Closure
  ] );

(These are all accepted shorthand for roles in the Iterator::Flex::Role namespace.)

If capabilities must be added at run time, use the _roles entry in %AttrHash.

The specific roles used here are:

Next::ClosedSelf

This indicates that the next capability uses a closed over $self variable, and that Iterator::Flex should use the _self hash entry to initialize it.

State::Registry

This indicates that the exhaustion state should be stored in the central iterator Registry. Another implementation uses a closed over variable (and the role State::Closure). See "Exhaustion" in Iterator::Flex::Manual::Internals.

Reset::Closure
Prev::Closure
Current::Closure
Rewind::Closure

These indicate that the named capability is present and implemented as a closure.

All together

  package My::Array;

  use strict;
  use warnings;

  use parent 'Iterator::Flex::Base';

  sub new {
      my $class = shift;
      my $gpar = Ref::Util::is_hashref( $_[-1] ) ? pop : {};

      $class->_throw( parameter => "argument must be an ARRAY reference" )
        unless Ref::Util::is_arrayref( $_[0] );

      $class->SUPER::new( { array => $_[0] }, $gpar );
  }

  sub construct {
     my ( $class, $state ) = @_;

     # initialize lexical variables here
     ...
     my $arr = $state->{array};

     my %AttrHash = ( ... );
     return \%AttrHash;
 }

  __PACKAGE__->_add_roles( qw[
        State::Registry
        Next::ClosedSelf
        Rewind::Closure
        Reset::Closure
        Prev::Closure
        Current::Closure
  ] );

  1;

INTERNALS

SUPPORT

Bugs

Please report any bugs or feature requests to bug-iterator-flex@rt.cpan.org or through the web interface at: https://rt.cpan.org/Public/Dist/Display.html?Name=Iterator-Flex

Source

Source is available at

  https://gitlab.com/djerius/iterator-flex

and may be cloned from

  https://gitlab.com/djerius/iterator-flex.git

SEE ALSO

Please see those modules/websites for more information related to this module.

AUTHOR

Diab Jerius <djerius@cpan.org>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2018 by Smithsonian Astrophysical Observatory.

This is free software, licensed under:

  The GNU General Public License, Version 3, June 2007