The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::Parser - Simplifies text parsing. Easily extensible to parse any text format.

VERSION

version 0.918

SYNOPSIS

    use Text::Parser;

    my $parser = Text::Parser->new();
    $parser->read(shift);
    print $parser->get_records, "\n";

The above code reads the first command-line argument as a string, and assuming it is the name of a text file, it will print the content of the file to STDOUT. If the string is not the name of a text file it will throw an exception and exit.

    package MyParser;

    use parent 'Text::Parser';
    ## or use Moose; extends 'Text::Parser';

    sub save_record {
        my $self = shift;
        ## ...
    }

    package main;

    my $parser = MyParser->new(auto_split => 1, auto_chomp => 1, auto_trim => 'b');
    $parser->read(shift);
    foreach my $rec ($parser->get_records) {
        ## ...
    }

The above example shows how Text::Parser could be easily extended to parse a specific text format.

RATIONALE

Text parsing is perhaps the single most common thing that almost every Perl program does. Yet we don't have a lean, flexible, text parsing utility. Ideally, the developer should only have to specify the "grammar" of the text file she intends to parse. Everything else, like opening a file handle, closeing the file handle, tracking line-count, joining continued lines into one, reporting any errors in line continuation, trimming white space, splitting each line into fields, etc., should be automatic.

Unfortunately however, most file parsing code looks like this:

    open FH, "<$fname";
    my $line_count = 0;
    while (<FH>) {
        $line_count++;
        chomp;
        $_ = trim $_;  ## From String::Util
        my (@fields) = split /\s+/;
        # do something for each line ...
    }
    close FH;

Note that a developer may have to repeat all of the above if she has to read another file with different content or format. And if the target text format allows line-wrapping with a continuation character, it isn't easy to implement it well with this while loop.

With Text::Parser, developers can focus on specifying the grammar and simply use the read method. Just extend (inherit) this class and override one method (save_record). Voila! you have a parser. These examples illustrate how easy this can be.

DESCRIPTION

Text::Parser is a format-agnostic text parsing base class. Derived classes can specify the format-specific syntax they intend to parse.

Future versions are expected to include progress-bar support, parsing text from sockets, UTF support, or parsing from a chunk of memory.

CONSTRUCTOR

new

Takes optional attributes as in example below. See section ATTRIBUTES for a list of the attributes and their description.

    my $parser = Text::Parser->new(
        auto_chomp      => 0,
        multiline_type  => 'join_last',
        auto_trim       => 'b',
        auto_split      => 1,
        FS              => qr/\s+/,
    );

ATTRIBUTES

The attributes below can be used as options to the new constructor. Each attribute has an accessor with the same name.

auto_chomp

Read-write attribute. Takes a boolean value as parameter. Defaults to 0.

    print "Parser will chomp lines automatically\n" if $parser->auto_chomp;

auto_split

A set-once-only attribute that can be set only during object construction. Defaults to 0. This attribute indicates if the parser will automatically split every line into fields.

If it is set to a true value, each line will be split into fields, and six methods (like field, find_field, etc.) become accessible within the save_record method. These methods are documented in Text::Parser::AutoSplit.

auto_trim

Read-write attribute. The values this can take are shown under the new constructor also. Defaults to 'n' (neither side spaces will be trimmed).

    $parser->auto_trim('l');       # 'l' (left), 'r' (right), 'b' (both), 'n' (neither) (Default)

FS

Read-write attribute that can be used to specify the field separator along with auto_split attribute. It must be a regular expression reference enclosed in the qr function, like qr/\s+|[,]/ which will split across either spaces or commas. The default value for this argument is qr/\s+/.

The name for this attribute comes from the built-in FS variable in the popular GNU Awk program.

    $parser->FS( qr/\s+\(*|\s*\)/ );

FS can be changed in your implementation of save_record. But the changes would take effect only on the next line.

multiline_type

If the target text format allows line-wrapping with a continuation character, the multiline_type option tells the parser to join them into a single line. When setting this attribute, one must re-define two more methods. See these examples.

By default, the multiline_type attribute has a value of undef, i.e., the target text format will not have wrapped lines. It can be set to either 'join_next' or 'join_last'. Once set, it cannot be set back to undef again.

    $parser->multiline_type(undef);
    $parser->multiline_type('join_next');

    my $mult = $parser->multiline_type;
    print "Parser is a multi-line parser of type: $mult" if defined $mult;
  • If the target format allows line-wrapping to the next line, set multiline_type to join_next. This example illustrates this case.

  • If the target format allows line-wrapping from the last line, set multiline_type to join_last. This simple SPICE line-joiner illustrates this case.

  • To "slurp" a file into a single string, set multiline_type to join_last. In this special case, you don't need to re-define the is_line_continued and join_last_line methods. See this trivial line-joiner example.

METHODS

These are meant to be called from the ::main program or within subclasses. In general, don't override them - just use them.

read

Takes an optional argument, either a string containing the name of the file, or a filehandle reference (a GLOB) like \*STDIN or an object of the FileHandle class.

    $parser->read($filename);

    # The above is equivalent to the following
    $parser->filename($filename);
    $parser->read();

    # You can also read from a previously opened file handle directly
    $parser->filehandle(\*STDIN);
    $parser->read();

Returns once all records have been read or if an exception is thrown, or if reading has been aborted with the abort_reading method.

If you provide a filename as input, the function will handle all open and close operations on files even if any exception is thrown, or if the reading has been aborted. But if you pass a file handle GLOB or FileHandle object instead, then the file handle won't be closed and it will be the responsibility of the calling program to close the filehandle.

    $parser->read('myfile.txt');
    # Will handle open, parsing, and closing of file automatically.

    open MYFH, "<myfile.txt" or die "Can't open file myfile.txt at ";
    $parser->read(\*MYFH);
    # Will not close MYFH and it is the respo
    close MYFH;

Note: To extend the class to other file formats, override save_record.

filename

Takes an optional string argument containing the name of a file. Returns the name of the file that was last opened if any. Returns undef if no file has been opened.

    print "Last read ", $parser->filename, "\n";

The file name is "persistent" in the object. Meaning, that the filename method remembers the last file that was read.

    $parser->read(shift @ARGV);
    print $parser->filename(), ":\n",
          "=" x (length($parser->filename())+1),
          "\n",
          $parser->get_records(),
          "\n";

A read call with a filehandle, will reset last file name.

    $parser->read(\*MYFH);
    print "Last file name is lost\n" if not defined $parser->filename();

filehandle

Takes an optional argument, that is a filehandle GLOB (such as \*STDIN) or an object of the FileHandle class. Returns the filehandle last saved, or undef if none was saved.

    my $fh = $parser->filehandle();

Like in the case of filename method, filehandle is also "persistent" and remembers previous state even after read.

    my $lastfh = $parser->filehandle();
    ## Will return STDOUT
    
    $parser->read('another.txt');
    print "No filehandle saved any more\n" if
                        not defined $parser->filehandle();

lines_parsed

Takes no arguments. Returns the number of lines last parsed.

    print $parser->lines_parsed, " lines were parsed\n";

Every call of read, causes the value to be auto-reset before parsing a new file.

has_aborted

Takes no arguments, returns a boolean to indicate if text reading was aborted in the middle.

    print "Aborted\n" if $parser->has_aborted();

get_records

Takes no arguments. Returns an array containing all the records saved by the parser.

    foreach my $record ( $parser->get_records ) {
        $i++;
        print "Record: $i: ", $record, "\n";
    }

pop_record

Takes no arguments and pops the last saved record.

    my $last_rec = $parser->pop_record;
    $uc_last = uc $last_rec;
    $parser->save_record($uc_last);

last_record

Takes no arguments and returns the last saved record. Leaves the saved records untouched.

    my $last_rec = $parser->last_record;

FOR USE IN SUBCLASS ONLY

Do NOT override these methods. They are valid only within a subclass, inside the user-implementation of methods described under OVERRIDE IN SUBCLASS.

this_line

Takes no arguments, and returns the current line being parsed. For example:

    sub save_record {
        # ...
        do_something($self->this_line);
        # ...
    }

abort_reading

Takes no arguments. Returns 1. To be used only in the derived class to abort read in the middle. See this example.

    sub save_record {
        # ...
        $self->abort_reading if some_condition($self->this_line);
        # ...
    }

push_records

This is useful if one needs to implement the parsing of an include-like command in the parsed text format. The example below illustrates this.

    package OneParser;
    use parent 'Text::Parser';

    my save_record {
        # ...
        # Under some condition:
        my $parser = AnotherParser->new();
        $parser->read($some_file)
        $parser->push_records($parser->get_records);
        # ...
    }

OVERRIDE IN SUBCLASS

The following methods should never be called in the ::main program. They are meant to be overridden (or re-defined) in a subclass.

save_record

This method should be re-defined in a subclass to parse the target text format. To save a record, the re-defined implementation in the derived class must call SUPER::save_record (or super if you're using Moose) with exactly one argument as a record. If no arguments are passed, undef is stored as a record.

For a developer re-defining save_record, in addition to this_line, six additional methods become available if the auto_split attribute is set. These methods are described in greater detail in Text::Parser::AutoSplit, and they are accessible only within save_record.

Note: Developers may store records in any form - string, array reference, hash reference, complex data structure, or an object of some class. The program that reads these records using get_records has to interpret them. So developers should document the records created by their own implementation of save_record.

FOR MULTI-LINE TEXT PARSING

These methods need to be re-defined by only multiline derived classes, i.e., if the target text format allows wrapping the content of one line into multiple lines. In most cases, you should re-define both methods. As usual, the this_line method may be used while re-defining them.

is_line_continued

This takes a string argument and returns a boolean indicating if the line is continued or not. See Text::Parser::Multiline for more on this.

The return values of the default method provided with this class are:

    multiline_type    |    Return value
    ------------------+---------------------------------
    undef             |         0
    join_last         |    0 for first line, 1 otherwise
    join_next         |         1

join_last_line

This method takes two strings, joins them while removing any continuation characters, and returns the result. The default implementation just concatenates two strings and returns the result without removing anything (not even chomp). See Text::Parser::Multiline for more on this.

EXAMPLES

Example 1 : A simple CSV Parser

We will write a parser for a simple CSV file that reads each line and stores the records as array references. This example is oversimplified, and does not handle embedded newlines.

    package Text::Parser::CSV;
    use parent 'Text::Parser';
    use Text::CSV;

    my $csv;
    sub save_record {
        my ($self, $line) = @_;
        $csv //= Text::CSV->new({ binary => 1, auto_diag => 1});
        $csv->parse($line);
        $self->SUPER::save_record([$csv->fields]);
    }

That's it! Now in main:: you can write something like this:

    use Text::Parser::CSV;
    
    my $csvp = Text::Parser::CSV->new();
    $csvp->read(shift @ARGV);
    foreach my $aref ($csvp->get_records) {
        my (@arr) = @{$aref};
        print "@arr\n";
    }

The above program reads the content of a given CSV file and prints the content out in space-separated form.

Example 2 : Error checking

Note: Read the documentation for Exceptions to learn about creating, throwing, and catching exceptions in Perl 5. All of the methods of creating, throwing, and catching exceptions described in Exceptions are supported.

You can throw exceptions from save_record in your subclass, for example, when you detect a syntax error. The read method will close all filehandles automatically as soon as an exception is thrown. The exception will pass through to ::main unless you catch and handle it in your derived class.

Here is an example showing the use of an exception to detect a syntax error in a file:

    package My::Text::Parser;
    use Exception::Class (
        'My::Text::Parser::SyntaxError' => {
            description => 'syntax error',
            alias => 'throw_syntax_error', 
        },
    );
    
    use parent 'Text::Parser';

    sub save_record {
        my ($self, $line) = @_;
        throw_syntax_error(error => 'syntax error') if _syntax_error($line);
        $self->SUPER::save_record($line);
    }

Example 3 : Aborting without errors

We can also abort parsing a text file without throwing an exception. This could be if we got the information we needed. For example:

    package SomeParser;
    use Moose;
    extends 'Text::Parser';

    sub BUILDARGS {
        my $pkg = shift;
        return {auto_split => 1};
    }

    sub save_record {
        my ($self, $line) = @_;
        return $self->abort_reading() if $self->field(0) eq '**ABORT';
        return $self->SUPER::save_record($line);
    }

Above is shown a parser SomeParser that would save each line as a record, but would abort reading the rest of the file as soon as it reaches a line with **ABORT as the first word. When this parser is given the following file as input:

    somefile.txt:

    Some text is here.
    More text here.
    **ABORT reading
    This text is not read
    This text is not read
    This text is not read
    This text is not read

You can now write a program as follows:

    use SomeParser;

    my $par = SomeParser->new();
    $par->read('somefile.txt');
    print $par->get_records(), "\n";

The output will be:

    Some text is here.
    More text here.

Example 4 : Multi-line parsing

Some text formats allow users to split a line into several lines with a line continuation character (usually at the end or the beginning of a line).

Trivial line-joiner

Below is a trivial example where all lines are joined into one:

    use strict;
    use warnings;
    use Text::Parser;

    my $join_all = Text::Parser->new(auto_chomp => 1, multiline_type => 'join_last');
    $join_all->read('input.txt');
    print $join_all->get_records(), "\n";

Another trivial example is here.

Continue with character

(Pun intended! ;-))

In the above example, all lines are joined (indiscriminately). But most often text formats have a continuation character that specifies that the line continues to the next line, or that the line is a continuation of the previous line. Here's an example parser that treats the back-slash (\) character as a line-continuation character:

    package MyMultilineParser;
    use parent 'Text::Parser';
    use strict;
    use warnings;

    sub new {
        my $pkg = shift;
        $pkg->SUPER::new(multiline_type => 'join_next');
    }

    sub is_line_continued {
        my $self = shift;
        my $line = shift;
        chomp $line;
        return $line =~ /\\\s*$/;
    }

    sub join_last_line {
        my $self = shift;
        my ($last, $line) = (shift, shift);
        chomp $last;
        $last =~ s/\\\s*$/ /g;
        return $last . $line;
    }

    1;

In your main::

    use MyMultilineParser;
    use strict;
    use warnings;

    my $parser = MyMultilineParser->new();
    $parser->read('multiline.txt');
    print "Read:\n"
    print $parser->get_records(), "\n";

Try with the following input multiline.txt:

    Garbage In.\
    Garbage Out!

When you run the above code with this file, you should get:

    Read:
    Garbage In. Garbage Out!

Simple SPICE line joiner

Some text formats allow a line to indicate that it is continuing from a previous line. For example SPICE has a continuation character (+) on the next line, indicating that the text on that line should be joined with the previous line. Let's show how to build a simple SPICE line-joiner. To build a full-fledged parser you will have to specify the rich and complex grammar for SPICE circuit description.

    use TrivialSpiceJoin;
    use parent 'Text::Parser';

    use constant {
        SPICE_LINE_CONTD => qr/^[+]\s*/,
        SPICE_END_FILE   => qr/^\.end/i,
    };

    sub new {
        my $pkg = shift;
        $pkg->SUPER::new(auto_chomp => 1, multiline_type => 'join_last');
    }

    sub is_line_continued {
        my ( $self, $line ) = @_;
        return 0 if not defined $line;
        return $line =~ SPICE_LINE_CONTD;
    }
    
    sub join_last_line {
        my ( $self, $last, $line ) = ( shift, shift, shift );
        return $last if not defined $line;
        $line =~ s/^[+]\s*/ /;
        return $line if not defined $last;
        return $last . $line;
    }

    sub save_record {
        my ( $self, $line ) = @_;
        return $self->abort_reading() if $line =~ SPICE_END_FILE;
        $self->SUPER::save_record($line);
    }

Try this parser with a SPICE deck with continuation characters and see what you get. Try having errors in the file. You may now write a more elaborate method for save_record above and that could be used to parse a full SPICE file.

SEE ALSO

BUGS

Please report any bugs or feature requests on the bugtracker website http://github.com/balajirama/Text-Parser/issues

When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.

AUTHOR

Balaji Ramasubramanian <balajiram@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2018-2019 by Balaji Ramasubramanian.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

CONTRIBUTORS

  • H.Merijn Brand - Tux <h.m.brand@xs4all.nl>

  • Mohammad S Anwar <mohammad.anwar@yahoo.com>