———package
XML::SAX::ByRecord;
=head1 NAME
XML::SAX::ByRecord - Record oriented processing of (data) documents
=head1 SYNOPSIS
use XML::SAX::Machines qw( ByRecord ) ;
my $m = ByRecord(
"My::RecordFilter1",
"My::RecordFilter2",
...
{
Handler => $h, ## optional
}
);
$m->parse_uri( "foo.xml" );
=head1 DESCRIPTION
XML::SAX::ByRecord is a SAX machine that treats a document as a series
of records. Everything before and after the records is emitted as-is
while the records are excerpted in to little mini-documents and run one
at a time through the filter pipeline contained in ByRecord.
The output is a document that has the same exact things before, after,
and between the records that the input document did, but which has run
each record through a filter. So if a document has 10 records in it,
the per-record filter pipeline will see 10 sets of ( start_document,
body of record, end_document ) events. An example is below.
This has several use cases:
=over
=item *
Big, record oriented documents
Big documents can be treated a record at a time with various DOM oriented
processors like L<XML::Filter::XSLT>.
=item *
Streaming XML
Small sections of an XML stream can be run through a document processor
without holding up the stream.
=item *
Record oriented style sheets / processors
Sometimes it's just plain easier to write a style sheet or SAX filter that
applies to a single record at at time, rather than having to run through a
series of records.
=back
=head2 Topology
Here's how the innards look:
+-----------------------------------------------------------+
| An XML:SAX::ByRecord |
| Intake |
| +----------+ +---------+ +--------+ Exhaust |
--+-->| Splitter |--->| Stage_1 |-->...-->| Merger |----------+----->
| +----------+ +---------+ +--------+ |
| \ ^ |
| \ | |
| +---------->---------------+ |
| Events not in any records |
| |
+-----------------------------------------------------------+
The C<Splitter> is an L<XML::Filter::DocSplitter> by default, and the
C<Merger> is an L<XML::Filter::Merger> by default. The line that
bypasses the "Stage_1 ..." filter pipeline is used for all events that
do not occur in a record. All events that occur in a record pass
through the filter pipeline.
=head2 Example
Here's a quick little filter to uppercase text content:
package My::Filter::Uc;
use vars qw( @ISA );
@ISA = qw( XML::SAX::Base );
use XML::SAX::Base;
sub characters {
my $self = shift;
my ( $data ) = @_;
$data->{Data} = uc $data->{Data};
$self->SUPER::characters( @_ );
}
And here's a little machine that uses it:
$m = Pipeline(
ByRecord( "My::Filter::Uc" ),
\$out,
);
When fed a document like:
<root> a
<rec>b</rec> c
<rec>d</rec> e
<rec>f</rec> g
</root>
the output looks like:
<root> a
<rec>B</rec> c
<rec>C</rec> e
<rec>D</rec> g
</root>
and the My::Filter::Uc got three sets of events like:
start_document
start_element: <rec>
characters: 'b'
end_element: </rec>
end_document
start_document
start_element: <rec>
characters: 'd'
end_element: </rec>
end_document
start_document
start_element: <rec>
characters: 'f'
end_element: </rec>
end_document
=cut
$VERSION
= 0.1;
use
strict;
use
Carp;
=head1 METHODS
=over
=item new
my $d = XML::SAX::ByRecord->new( @channels, \%options );
Longhand for calling the ByRecord function exported by XML::SAX::Machines.
=cut
sub
new {
my
$proto
=
shift
;
my
$class
=
ref
$proto
||
$proto
;
my
@options_hash_if_present
=
@_
&&
ref
$_
[-1] eq
"HASH"
?
pop
: () ;
my
$stage_num
= 0;
my
@machine_spec
= (
[
Intake
=>
"XML::Filter::DocSplitter"
],
map
( [
"Stage_"
.
$stage_num
++ =>
$_
],
@_
),
[
Merger
=>
"XML::Filter::Merger"
=>
qw( Exhaust )
],
);
push
@{
$machine_spec
[
$_
]},
"Stage_"
.
$_
for
0..
$#machine_spec
-2 ;
push
@{
$machine_spec
[-2]},
"Merger"
if
@machine_spec
;
my
$self
=
$proto
->SUPER::new(
@machine_spec
,
@options_hash_if_present
);
my
$distributor
=
$self
->find_part( 0 );
$distributor
->set_aggregator(
$self
->find_part( -1 ) )
if
$distributor
->can(
"set_aggregator"
);
return
$self
;
}
=back
=head1 CREDIT
Proposed by Matt Sergeant, with advise by Kip Hampton and Robin Berjon.
=head1 Writing an aggregator.
To be written. Pretty much just that C<start_manifold_processing> and
C<end_manifold_processing> need to be provided. See L<XML::Filter::Merger>
and it's source code for a starter.
=cut
1;