Text::Parser - Simplifies text parsing. Easily extensible to parse any text format.


version 1.000


The following prints the content of the file (named in the first argument) to STDOUT.

    use Text::Parser;

    my $parser = Text::Parser->new();
    print $parser->get_records, "\n";

The above code prints after reading the whole file, which can be slow if you have large fules. This following prints contents immediately.

    my $parser = Text::Parser->new();
    $parser->add_rule(do => 'print', dont_record => 1);
    ($#ARGV > 0) ? $parser->filename(shift) : $parser->filehandle(\*STDIN);
    $parser->read();       # Runs the rule for each line of input file

Also, the third line there allows this program to read from a file name specified on command-line, or STDIN. In effect, this makes this Perl code a good replica of the UNIX cat.

Here is an example with a simple rule that extracts the first error in the logfile and aborts reading further:

    my $parser = Text::Parser->new();
        if => '$1 eq "ERROR:"',
            # $1 is a positional identifier for first 'field' on the line
        do => '$this->abort_reading; return $_;'
            # $this is copy of $parser accessible from within the rule
            # abort_reading() tells parser to stop reading further
            # Returned values are saved as records. Any data structure can be saved.
            # $_ contains the full line as string, including any whitespaces
    # Returns the first line starting with "ERROR:"

    print "Some errors were found:\n" if $parser->get_records();

See this important note about using single quotes instead of double quotes.

Here is an example that parses a table with field separators indicated by | character:

    use Data::Dumper 'Dumper';
    my $table_parser = Text::Parser->new( FS => qr/\s*[|]\s*/ );
        if          => '$this->NF == 0',
        dont_record => 1
        if => '$this->lines_parsed == 1',
        do => '~columns = [$this->fields()];'
        if => '$this->lines_parsed > 1',
        do =>  'my %rec = ();
                foreach my $i (0..$#{~columns}) {
                    my $k = ~columns->[$i];
                    $rec{$k} = $this->field($i);
                return \%rec;',
    print Dumper($table_parser->get_records()), "\n";

In the above example you see the use of a stashed variable named ~columns. Note that the sigil used here is not a Perl sigil, but is converted to native Perl code. In the above case, each record is a hash with fixed number of fields.

More complex file-formats can be read and contents stored in a data-structure or an object. Here is an example:

    use strict;
    use warnings;

    package ComplexFormatParser;
    use Text::Parser::RuleSpec;  ## provides applies_rule + other sugar, imports Moose
    extends 'Text::Parser';

    # This rule ignores all comments
    applies_rule ignore_comments => (
        if          => 'substr($1, 0, 1) eq "#"', 
        dont_record => 1, 

    # An attribute of the parser class. 
    has current_section => (
        is         => 'rw', 
        isa        => 'Str', 
        default    => undef, 

    applies_rule get_header => (
        if          => '$1 eq "SECTION"', 
        do          => '$this->current_section($2);',  # $this : this parser object
        dont_record => 1, 

    # ... More can be done

    package main;
    use ComplexFormatParser;

    my $p = ComplexFormatParser->new();


The need for this class stems from the fact that text parsing is the most common thing that programmers do, and yet there is no lean, simple way to do it in Perl. Most programmers still write boilerplate code with a while loop.

Instead Text::Parser allows programmers to parse text with simple, self-explanatory rules, whose structure is very similar to AWK, but extends beyond the capability of AWK.

Sidenote: Incidentally, AWK is one of the ancestors of Perl! One would have expected Perl to do way better than AWK. But while you can use Perl to do what AWK already does, that is usually limited to one-liners like perl -lane. Even perl -lan is not meant for serious projects. And it seems that some people still prefer AWK to Perl. This is not looking good.


With Text::Parser, a developer can focus on specifying a grammar and then simply read the file. The read method automatically runs each rule collecting records from the text input into an internal array. Finally, get_records can retrieve the records.

Since Text::Parser is a class, a programmer can subclass it to parse very complex file formats. Text::Parser::RuleSpec provides intuitive rule sugar. Use of Moose is encouraged. And data from parsed files can be turned into very complex data-structures or even objects. In this case, you wouldn't need to use get_records.

With Text::Parser programmers have the elegance and simplicity of AWK combined with the power of Perl at their disposal.



Takes optional attributes as in example below. See section ATTRIBUTES for a list of the attributes and their description.

    my $parser = Text::Parser->new();

    my $parser2 = Text::Parser->new( line_wrap_style => 'trailing_backslash' );


The attributes below can be used as options to the new constructor. Each attribute has an accessor with the same name.


Read-write attribute. Takes a boolean value as parameter. Defaults to 0.

    print "Parser will chomp lines automatically\n" if $parser->auto_chomp;


Read-write boolean attribute. Defaults to 0 (false). Indicates if the parser will automatically split every line into fields.

If it is set to a true value, each line will be split into fields, and a set of methods become accessible to save_record or the rules.


Read-write attribute. The values this can take are shown under the new constructor also. Defaults to 'n' (neither side spaces will be trimmed).

    $parser->auto_trim('l');       # 'l' (left), 'r' (right), 'b' (both), 'n' (neither) (Default)


Read-write attribute which can be set to a custom subroutine that trims each line before applying any rules or saving any records. The function is expected to take a single argument containing the complete un-trimmed line, and is expected to return a manipulated line.

    sub _cust_trimmer {
        my $line = shift;
        chomp $line;
        return $line;


Note: If you set this attribute, you are entirely responsible for the trimming. Poorly written routines could causing the auto_split operation to misbehave.

By default it is undefined.


Read-write attribute that can be used to specify the field separator to be used by the auto_split feature. It must be a regular expression reference enclosed in the qr function, like qr/\s+|[,]/ which will split across either spaces or commas. The default value for this attribute is qr/\s+/.

The name for this attribute comes from the built-in FS variable in the popular GNU Awk program. The ability to use a regular expression is an upgrade from AWK.

    $parser->FS( qr/\s+\(*|\s*\)/ );

FS can be changed from within a rule. Changes made even within a rule would take effect on the immediately next line read.


This can be used to set the indentation character or string. By default it is a single space . But you may want to set it to be a tab (\t) or perhaps some other character like a hyphen (-) or even a string ( ->). This attribute is used only if track_indentation is set.


Read-write attribute used as a quick way to select from commonly known line-wrapping styles. If the target text format allows line-wrapping this attribute allows the programmer to write rules as if they were on a single line.


Allowed values are:

    trailing_backslash - very common style ending lines with \
                         and continuing on the next line

    spice              - used for SPICE syntax, where the (+)
                         + symbol continues content of last line

    just_next_line     - used in simple text files written to be
                         humanly-readable. New paragraphs start
                         on a new line after a blank line.

    slurp              - used to "slurp" the whole file into
                         a single line.

    custom             - user-defined style. User must specify
                         value of multiline_type and define
                         two custom unwrap routines using the
                         custom_line_unwrap_routines method
                         when custom is chosen.

When line_wrap_style is set to one of these values, the value of multiline_type is automatically set to an appropriate value. Read more about handling the common line-wrapping styles.


Read-write attribute used mainly if the programmer wishes to specify custom line-unwrapping methods. By default, this attribute is undef, i.e., the target text format will not have wrapped lines.


    my $mult = $parser->multiline_type;
    print "Parser is a multi-line parser of type: $mult" if defined $mult;

Allowed values for multiline_type are described below, but it can also be set back to undef.

  • If the target format allows line-wrapping to the next line, set multiline_type to join_next.

  • If the target format allows line-wrapping from the last line, set multiline_type to join_last.

To know more about how to use this, read about specifying custom line-unwrap routines.


This boolean attribute enables tracking of the number of indentation characters are there at the beginning of each line. In some text formats, this is a very important information that can indicate the depth of some data. By default, this is false. When set to a true value, you can get the number of indentation characters on a given line with the this_indent method.


Now you can use this_indent method in the rules:

    $parser->add_rule(if => '$this->this_indent > 0', do => '~num_indented ++;')


These are meant to be called from the ::main program or within subclasses.


Takes a hash as input. The keys of this hash must be the attributes of the Text::Parser::Rule class constructor and the values should also meet the requirements of that constructor.

    $parser->add_rule(do => '', dont_record => 1);                 # Empty rule: does nothing
    $parser->add_rule(if => 'm/li/, do => 'print', dont_record);   # Prints lines with 'li'
    $parser->add_rule( do => 'uc($3)' );                           # Saves records of upper-cased third elements

Calling this method without any arguments will throw an exception. The method internally sets the auto_split attribute.


Takes no arguments, returns nothing. Clears the rules that were added to the object.


This is useful to be able to re-use the parser after a read call, to parse another text with another set of rules. The clear_rules method does clear even the rules set up by BEGIN_rule and END_rule.


Takes a hash input like add_rule, but if and continue_to_next keys will be ignored.

    $parser->BEGIN_rule(do => '~count = 0;');
  • Since any if key is ignored, the do key is required. Multiple calls to BEGIN_rule will append to the previous calls; meaning, the actions of previous calls will be included.

  • The BEGIN_rule is mainly used to initialize some variables.

  • By default dont_record is set true. User can change this and set dont_record as false, thus forcing a record to be saved even before reading the first line of text.


Takes a hash input like add_rule, but if and continue_to_next keys will be ignored. Similar to BEGIN_rule, but the actions in the END_rule will be executed at the end of the read method.

    $parser->END_rule(do => 'print ~count, "\n";');
  • Since any if key is ignored, the do key is required. Multiple calls to END_rule will append to the previous calls; meaning, the actions of previous calls will be included.

  • The END_rule is mainly used to do final processing of collected records.

  • By default dont_record is set true. User can change this and set dont_record as false, thus forcing a record to be saved after the end rule is processed.


These methods can be used only inside rules, or methods of a subclass. Some of these methods are available only when auto_split is on. They are listed as follows:

  • NF - number of fields on this line

  • fields - all the fields as an array of strings ; trailing \n removed

  • field - access individual elements of the array above ; negative arguments count from back

  • field_range - array of fields in the given range of indices ; negative arguments allowed

  • join_range - join the fields in the range of indices ; negative arguments allowed

  • find_field - returns field for which a given subroutine is true ; each field is passed to the subroutine in $_

  • find_field_index - similar to above, except it returns the index of the field instead of the field itself

  • splice_fields - like the native Perl splice

Other methods described below are also to be used only inside a rule, or inside methods called by the rules.


Takes no arguments. Returns 1. Aborts reading any more lines, and read method exits gracefully as if nothing unusual happened.

        do          => '$this->abort_reading;',
        if          => '$1 eq "EOF"', 
        dont_record => 1, 


Takes no arguments, and returns the number of indentation characters found at the front of the current line. This can be called from within a rule:

    $parser->add_rule( if => '$this->this_indent > 0', );


Takes no arguments, and returns the current line being parsed. For example:

        if => 'length($this->this_line) > 256', 
    ## Saves all lines longer than 256 characters

Inside rules, instead of using this method, one may also use $_:

        if => 'length($_) > 256', 



Takes an optional string argument containing the name of a file. Returns the name of the file that was last opened if any. Returns undef if no file has been opened.

    print "Last read ", $parser->filename, "\n";

The value stored is "persistent" - meaning that the method remembers the last file that was read.

    $parser->read(shift @ARGV);
    print $parser->filename(), ":\n",
          "=" x (length($parser->filename())+1),

A read call with a filehandle, will clear the last file name.

    print "Last file name is lost\n" if not defined $parser->filename();


Takes an optional argument, that is a filehandle GLOB (such as \*STDIN) or an object of the FileHandle class. Returns the filehandle last saved, or undef if none was saved.

    my $fh = $parser->filehandle();

Like filename, filehandle is also "persistent". Its old value is lost when either filename is set, or read is called with a filename.

    my $lastfh = $parser->filehandle();          # Will return glob of STDOUT


Takes a single optional argument that can be either a string containing the name of the file, or a filehandle reference (a GLOB) like \*STDIN or an object of the FileHandle class.

    $parser->read($filename);         # Read the file
    $parser->read(\*STDIN);           # Read the filehandle

The above could also be done in two steps if the developer so chooses.

    $parser->read();                  # equiv: $parser->read($filename)

    $parser->read();                  # equiv: $parser->read(\*STDIN)

The method returns once all records have been read, or if an exception is thrown, or if reading has been aborted with the abort_reading method.

Any close operation will be handled (even if any exception is thrown), as long as read is called with a file name parameter - not if you call with a file handle or GLOB parameter.

    $parser->read('myfile.txt');      # Will close file automatically

    open MYFH, "<myfile.txt" or die "Can't open file myfile.txt at ";
    $parser->read(\*MYFH);            # Will not close MYFH
    close MYFH;



Takes no arguments. Returns an array containing all the records saved by the parser.

    foreach my $record ( $parser->get_records ) {
        print "Record: $i: ", $record, "\n";


Takes no arguments and returns the last saved record. Leaves the saved records untouched.

    my $last_rec = $parser->last_record;


Takes no arguments and pops the last saved record.

    my $last_rec = $parser->pop_record;


Takes an array as input, and stores each element as a separate record. Returns the number of elements in the new array.

    $parser->push_records(qw(insert these as separate records));


Stashed variables can be data structures or simple scalar variables stored as elements in the parser object. Hence they are accessible across different rules. Stashed variables start with a tilde (~). So you could set up rules like these:

    $parser->BEGIN_rule( do => '~count=0;' );
    $parser->add_rule( if => '$1 eq "SECTION"', do => '~count++;' );

In the above rule ~count is a stashed variable. Internally this is just a hash element with key named count. After the read call is over, this variable can be accessed.

    print "Found ", $parser->stashed('count'), " sections in file.\n";

Stashed variables that are created entirely within the rules are forgotten at the beginning of the next read call. This means, you can read another text file and don't have to bother to clear out the stashed variable ~count.

    print "Found ", $parser->stashed('count'), " sections in file.\n";

In contrast, stashed variables created by calling prestash continue to persist for subsequent calls of read, unless an explicit call to forget names these pre-stashed variables.

    $parser->prestash( max_err => 100 );
    $parser->BEGIN_rule( do => '~err_count = 0;' );
        if               => '$1 eq "ERROR:" && ~err_count < ~max_err',
        do               => '~err_count++;', 
        continue_to_next => 1, 
        if => '$1 eq "ERROR:" && ~err_count == ~max_err',
        do => '$this->abort_reading;', 
    print "Top 100 errors:\n", $parser->get_records, "\n";
    $parser->read('another.log');         # max_err is still set to 100, but err_count is forgotten and reset to 0 by the BEGIN_rule
    print "Top 100 errors:\n", $parser->get_records, "\n";


Takes an optional list of string arguments which must be the names of stashed variables. This method forgets those stashed variables for ever. So be sure you really intend to do this. In list context, this method returns the values of the variables whose names were passed to the method. In scalar context, it returns the last value of the last stashed variable passed.

    my $pop_and_forget_me = $parser->forget('forget_me_totally', 'pop_and_forget_me');

Inside rules, you could simply delete the stashed variable like this:

    $parser->add_rule( do => 'delete ~forget_me;' );

The above delete statement works because the stashed variable ~forget_me is just a hash key named forget_me internally. Using this on pre-stashed variables, will only temporarily delete the variable. It will be present in subsequent calls to read. If you want to delete it completely call forget with the pre-stashed variable name as an argument.

When no arguments are passed, it clears all stashed variables (not pre-stashed).


Note that when forget is called with no arguments, pre-stashed variables are not deleted and are still accessible in subsequent calls to read. To forget a pre-stashed variable, it needs to be explicitly named in a call to forget. Then it is forgotten.

A call to forget method is done without any arguments, right before read starts reading a new text input. That is how we can reset the values of stashed variables, but still retain pre-stashed variables.


Takes no arguments and returns a true value if the stash of variables is empty (i.e., no stashed variables are present). If not, it returns a boolean false.

    if ( not $parser->has_empty_stash ) {
        my $myvar = $parser->stashed('myvar');
        print "myvar = $myvar\n";


Takes a single string argument and returns a boolean indicating if there is a stashed variable with that name or not:

    if ( $parser->has_stashed('stashed_var') ) {
        print "Here is what stashed_var contains: ", $parser->stashed('stashed_var');

Inside rules you could check this with the exists keyword:

    $parser->add_rule( if => 'exists ~stashed_var' );


Takes an even number of arguments, or a hash, with variable name and value as pairs. This is useful to preset some stash variables before read is called so that the rules have some variables accessible inside them. The main difference between pre-stashed variables created via prestash and those created in the rules or using stashed is that the pre-stashed ones are static.

    $parser->prestash(pattern => 'string');
    $parser->add_rule( if => 'my $patt = ~pattern; m/$patt/;' );

You may change the value of a prestashed variable inside any of the rules.


Takes an optional list of string arguments each with the name of a stashed variable you want to query, i.e., get the value of. In list context, it returns their values in the same order as the queried variables, and in scalar context it returns the value of the last variable queried.

    my (%var_vals) = $parser->stashed;
    my (@vars)     = $parser->stashed( qw(first second third) );
    my $third      = $parser->stashed( qw(first second third) ); # returns value of last variable listed
    my $myvar      = $parser->stashed('myvar');

Or you could do this:

    use Data::Dumper 'Dumper';

    if ( $parser->has_empty_stash ) {
        print "Nothing on my stash\n";
    } else {
        my %stash = $parser->stashed;
        print Dumper(\%stash), "\n";



Takes no arguments. Returns the number of lines last parsed. Every call to read, causes the value to be auto-reset.

    print $parser->lines_parsed, " lines were parsed\n";


Takes no arguments, returns a boolean to indicate if text reading was aborted in the middle.

    print "Aborted\n" if $parser->has_aborted();


This method should be used only when the line-wrapping supported by the text format is not already among the known line-wrapping styles supported.

Takes a hash argument with required keys is_wrapped and unwrap_routine. Used in setting up custom line-unwrapping routines.

Here is an example of setting custom line-unwrapping routines:

        is_wrapped => sub {     # A method that detects if this line is wrapped or not
            my ($self, $this_line) = @_;
            $this_line =~ /^[~]/;
        unwrap_routine => sub { # A method to unwrap the line by joining it with the last line
            my ($self, $last_line, $this_line) = @_;
            chomp $last_line;
            $last_line =~ s/\s*$//g;
            $this_line =~ s/^[~]\s*//g;
            "$last_line $this_line";

Now you can parse a file with the following content:

    This is a long line that is wrapped around with a custom
    ~ character - the tilde. It is unusual, but hey, we're
    ~ showing an example.

When $parser gets to read this, these three lines get unwrapped and processed by the rules, as if it were a single line.

Text::Parser::Multiline shows another example with join_next type.


The following methods should never be called in the ::main program. They may be overridden (or re-defined) in a subclass.

Starting version 0.925, users should never need to override any of these methods to make their own parser.


The default implementation takes a single argument, runs any rules, and saves the returned value as a record in an internal array. If nothing is returned from the rule, undef is stored as a record.

Note: Starting 0.925 version of Text::Parser it is not required to override this method in your derived class. In most cases, you should use the rules.

Importnant Note: Starting version 1.0 of Text::Parser this method will be deprecated to improve performance. So avoid inheriting this method.


The default implementation of this routine:

    multiline_type    |    Return value
    undef             |         0
    join_last         |    0 for first line, 1 otherwise
    join_next         |         1

In earlier versions of Text::Parser you had no way but to subclass Text::Parser to change the routine that detects if a line is wrapped. Now you can instead select from a list of known line_wrap_styles, or even set custom methods for this.


The default implementation of this routine takes two string arguments, joins them without any chomp or any other operation, and returns that result.

In earlier versions of Text::Parser you had no way but to subclass Text::Parser to select a line-unwrapping routine. Now you can instead select from a list of known line_wrap_styles, or even set custom methods for this.


Future versions are expected to include features to:

  • read and parse from a buffer

  • automatically uncompress input

  • suggestions welcome ...

Contributions and suggestions are welcome and properly acknowledged.


Different text formats sometimes allow line-wrapping to make their content more human-readable. Handling this can be rather complicated if you use native Perl, but extremely easy with Text::Parser.

Common line-wrapping styles

Text::Parser supports a range of commonly-used line-unwrapping routines which can be selected using the line_wrap_style attribute. The attribute automatically sets up the parser to handle line-unwrapping for that specific text format.

    # Now when read runs the rules, all the back-slash
    # line-wrapped lines are auto-unwrapped to a single
    # line, and rules are applied on that single line

When read reads each line of text, it looks for any trailing backslash and unwraps the line. The next line may have a trailing back-slash too, and that too is unwrapped. Once the fully-unwrapped line has been identified, the rules are run on that unwrapped line, as if the file had no line-wrapping at all. So say the content of a line is like this:

    This is a long line wrapped into multiple lines \
    with a back-slash character. This is a very common \
    way to wrap long lines. In general, line-wrapping \
    can be much easier on the reader's eyes.

When read runs any rules in $parser, the text above appears as a single line to the rules.

Specifying custom line-unwrap routines

I have included the common types of line-wrapping styles known to me. But obviously there can be more. To specify a custom line-unwrapping style follow these steps:

  • Set the multiline_type attribute appropriately. If you do not set this, your custom unwrapping routines won't have any effect.

  • Call custom_line_unwrap_routines method. If you forget to call this method, or if you don't provide appropriate arguments, then an exception is thrown.

Here is an example with join_last value for multiline_type. And here is an example using join_next. You'll notice that in both examples, you need to specify both routines. In fact, if you don't

Line-unwrapping in a subclass

You may subclass Text::Paser to parse your specific text format. And that format may support some line-wrapping. To handle the known common line-wrapping styles, set a default value for line_wrap_style. For example:

  • Set a default value for line_wrap_style. For example, the following uses one of the supported common line-unwrap methods. has '+line_wrap_style' => ( default => 'spice', );

* Setup custom line-unwrap routines with unwraps_lines from Text::Parser::RuleSpec.

    use Text::Parser::RuleSpec;
    extends 'Text::Parser';

    has '+line_wrap_style' => ( default => 'slurp', is => 'ro');
    has '+multiline_type'  => ( is => 'ro' );

Of course, you don't have to make them read-only.

To setup custom line-unwrapping routines in a subclass, you can use the unwraps_lines_using syntax sugar from Text::Parser::RuleSpec. For example:

    package MyParser;

    use Text::Parser::RuleSpec;
    extends 'Text::Parser';

    has '+multiline_type' => (
        default => 'join_next',
        is => 'ro', 

        is_wrapped     => \&_my_is_wrapped_routine, 
        unwrap_routine => \&_my_unwrap_routine, 



Please report any bugs or feature requests on the bugtracker website

When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.


Balaji Ramasubramanian <>


This software is copyright (c) 2018-2019 by Balaji Ramasubramanian.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.


  • H.Merijn Brand - Tux <>

  • Mohammad S Anwar <>