PICA::Parser - Parse PICA+ data
use PICA::Parser; PICA::Parser->parsefile( $filename_or_handle , Field => \&field_handler, Record => \&record_handler ); PICA::Parser->parsedata( $string_or_function , Field => \&field_handler, Record => \&record_handler, Limit => 5 ); $parser = PICA::Parser->new( Record => \&record_handler, Proceed => 1 ); $parser->parsefile( $filename ); $parser->parsedata( $picadata ); print $parser->counter() . " records read.\n";
You can also export parsedata and parsefile:
parsedata
parsefile
use PICA::Parser qw(parsefile); parsefile( $filename, Record => sub { my $record = shift; print $record->string . "\n"; });
Both function return the parser, so you can use constructs like
my @records = parsefile($filename)->records();
To parse just one record you can use the special method writerecord which can be exported by PICA::Record:
use PICA::Record qw(writerecord); my $record = writerecord( $file );
Another method is to limit the parser to one record:
my ($record) = PICA::Parser->parsefile( $file, Limit => 1 )->records();
A PICA::Parser may emit some error messages to STDOUT but ignore most errors. If you want broken fields not to be ignored, add an error handler with FieldError:
my $parser = PICA::Parser->new( FieldError => sub { my $msg = shift; return $msg; } );
Broken record then will be passed to another error handler. To suppress all error messages and just ignore records with errors:
my $parser = PICA::Parser->new( FieldError => sub { return; }, RecordError => sub { return; } }
This module can be used to parse normalized PICA+ and PICA+ XML. The conrete parsers are implemented in PICA::PlainParser and PICA::XMLParser.
Creates a Parser to store common parameters (see below). These parameters will be used as default when calling parsefile or parsedata. Note that you do not have to use the constructor to use PICA::Parser. These two methods do the same:
PICA::Parser
PICA::Parser->new( %params )->parsefile( $file ); PICA::Parser->parsefile( $file, %params );
And for parsing plain data:
PICA::Parser->new( %params )->parsedata( $data ); PICA::Parser->parsedata( $data, %params );
Common parameters that are passed to the specific parser are:
Reference to a handler function for parsed PICA+ fields. The function is passed a PICA::Field object and it should return it back to the parser. You can use this function as a simple filter by returning a modified field. If undef is returned, the field will be skipped. If a non PICA::Field value is returned, the return value is used as error message and the record is marked as broken.
Reference to a handler function for parsed PICA+ records. The function is passed a PICA::Record. If the function returns a record then this record will be stored in an array that is passed to Collection. You can use this method as a filter by returning either a (modified) record or undef or an integer. If another defined value is returned, it is used as error message (broken record) and the record error handler is called.
Collection
Skip a given number of records. Default is zero.
Stop after a given number of records. Non positive numbers equal to unlimited.
This handler is called with character data of a line and error message when an input line could not be parsed into a PICA::Field object. By default such lines produce an error message on STDOUT but will be ignored. You can provide an error handler that either fixed the line by returning a PICA::Field, or returns undef to ignore the error or return true to mark the whole record as broken, so the RecordError handler will be called afterwards.
This handler is called with a record object or undef and an error message when a broken record was parsed. By default only empty records are marked as broken.
By default the internal counters are reset and all read records are forgotten before each call of parsefile and parsedata. If you set the Proceed parameter to a true value, the same parser will be reused without reseting counters and read record.
Proceed
Error handling is only implemented in PICA::PlainParser by now!
Parses pica data from a file, specified by a filename or filehandle. The default parser is PICA::PlainParser. If the filename extension is .xml or .xml.gz or the Format parameter set to xml then PICA::XMLParser is used instead.
.xml
.xml.gz
Format
xml
PICA::Parser->parsefile( "data.picaplus", Field => \&field_handler ); PICA::Parser->parsefile( \*STDIN, Field => \&field_handler, Format='XML' ); PICA::Parser->parsefile( "data.xml", Record => sub { ... } );
See the constructor new for a description of parameters.
new
Parses data from a string, array reference, function, or PICA::Record object and returns the PICA::Parser that was used. See parsefile and the parsedata method of PICA::PlainParser and PICA::XMLParser for a description of parameters. By default PICA::PlainParser is used unless there the Format parameter set to xml.
PICA::Parser->parsedata( $picastring, Field => \&field_handler ); PICA::Parser->parsedata( \@picalines, Field => \&field_handler ); # called as a function my @records = parsedata( $picastring )->records();
Get an array of the read records (as returned by the record handler which can thus be used as a filter). If no record handler was specified, records will be collected unmodified. For large record sets it is recommended not to collect the records but directly use them with a record handler.
Get the number of read records so far. Please note that the number of records as returned by the records method may be lower because you may have filtered out some records.
records
Enable :utf8 layer for a given filehandle unless it or some other encoding has already been enabled. You should not need this method.
Internal method to get a new parser of the internal parser of this object. By default, gives a PICA:PlainParser unless you specify the Format parameter. Single parameters override the default parameters specified at the constructor (except the the Proceed parameter).
Better logging needs to be added, for instance a status message every n records. This may be implemented with multiple (piped?) handlers per record. Error handling of broken records should also be improved.
Jakob Voss <jakob.voss@gbv.de>
<jakob.voss@gbv.de>
Copyright (C) 2007-2009 by Verbundzentrale Goettingen (VZG) and Jakob Voss
This library is free software; you Ccan redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.
To install PICA::Record, copy and paste the appropriate command in to your terminal.
cpanm
cpanm PICA::Record
CPAN shell
perl -MCPAN -e shell install PICA::Record
For more information on module installation, please visit the detailed CPAN module installation guide.