Ben Bullock

NAME

C::Tokenize - reduce a C file to a series of tokens

SYNOPSIS

    # Remove all C preprocessor instructions from a C program:
    my $c = <<EOF;
    #define X Y
    #ifdef X
    int X;
    #endif
    EOF
    use C::Tokenize '$cpp_re';
    $c =~ s/$cpp_re//g;
    print "$c\n";
    

produces output

    int X;
    

(This example is included as synopsis-cpp.pl in the distribution.)

    # Print all the comments in a C program:
    my $c = <<EOF;
    /* This is the main program. */
    int main ()
    {
        int i;
        /* Increment i by 1. */
        i++;
        // Now exit with zero status.
        return 0;
    }
    EOF
    use C::Tokenize '$comment_re';
    while ($c =~ /($comment_re)/g) {
        print "$1\n";
    }
    

produces output

    /* This is the main program. */
    /* Increment i by 1. */
    // Now exit with zero status.
    

(This example is included as synopsis-comment.pl in the distribution.)

VERSION

This documents version 0.11 of C::Tokenize corresponding to git commit 98692825df8419440be4fa40fd09df0690472299 released on Tue Sep 6 10:43:09 2016 +0900.

DESCRIPTION

This module provides a tokenizer, "tokenize", which breaks C source code into its smallest meaningful components, and the regular expressions which match each of these components. For example, the module supplies a regular expression "$comment_re" which matches a C comment line.

It also supplies some extra regular expressions for, for example, local include statements, "$include_local", or C variables, "$cvar_re", as well as extra functions "decomment" for removing traditional C comments.

REGULAR EXPRESSIONS

The following regular expressions can be imported from this module using, for example,

    use C::Tokenize '$cpp_re'

to import $cpp_re.

Most of the following regular expressions do not do any capturing, except where noted. If you want to capture, add your own parentheses around the regular expression.

$trad_comment_re

Match /* */ comments.

$cxx_comment_re

Match // comments.

$comment_re

Match both /* */ and // comments.

$cpp_re

Match a C preprocessor instruction.

$char_const_re

Match a character constant, such as 'a' or '\-'.

$operator_re

Match an operator such as + or --.

$number_re

Match a number, either integer, floating point, or hexadecimal. Does not do octal yet.

$word_re

Match a word, such as a function or variable name or a keyword of the language.

$grammar_re

Match other syntactic characters such as { or [.

$single_string_re

Match a single C string constant such as "this".

$string_re

Match a full-blown C string constant, including compound strings "like" "this".

$reserved_re

Match a C reserved word like auto or goto.

$include_local

Match an include statement which uses double quotes, like #include "some.c".

This captures the entire statement in $1 and the file name in $2.

$cvar_re

This matches a C variable, for example anything which may be an lvalue or a function argument.

    use C::Tokenize '$cvar_re';
    my $c = 'func (x->y, & z, ** a, & q);';
    while ($c =~ /[,\(]\s*($cvar_re)/g) {
        print "$1 is a C variable.\n";
    }

produces output

    x->y is a C variable.
    & z is a C variable.
    ** a is a C variable.
    & q is a C variable.

(This example is included as cvar.pl in the distribution.)

VARIABLES

@fields

@Fields contains a list of all the fields which are extracted by "tokenize".

FUNCTIONS

decomment

    my $out = decomment ('/* comment */');
    # $out = " comment ";

Remove the traditional C comment marks /* and */ from the beginning and end of a string, leaving only the comment contents. The string has to begin and end with comment marks.

tokenize

    my $tokens = tokenize ($file);

Convert $file into a series of tokens. The return value is an array reference which contains hash references. Each hash reference corresponds to one token in the C file. Each token contains the following keys:

leading

Any whitespace which comes before the token (called "leading whitespace").

type

The type of the token, which may be

comment

A comment, like

    /* This */

or

    // this.
cpp

A C preprocessor instruction like

    #define THIS 1

or

    #include "That.h".
char_const

A character constant, like '\0' or 'a'.

grammar

A piece of C "grammar", like { or ] or ->.

number

A number such as 42,

word

A word, which may be a variable name or a function.

string

A string, like "this", or even "like" "this".

reserved

A C reserved word, like auto or goto.

All of the fields which may be captured are available in the variable "@fields" which can be exported from the module:

    use C::Tokenize '@fields';
$name

The value of the type. For example, if $token->{name} equals 'comment', then the value of the type is in , $token->{comment}.

    if ($token->{name} eq 'string') {
        my $c_string = $token->{string};
    }
line

The line number of the C file where the token occured. For a multi-line comment or preprocessor instruction, the line number refers to the final line.

EXPORTS

    use C::Tokenize ':all';

exports all the regular expressions and functions from the module.

SEE ALSO

BUGS

Octal not parsed

It does not parse octal expressions.

No trigraphs

No handling of trigraphs.

Requires Perl 5.10

This module uses named captures in regular expressions, so it requires Perl 5.10 or more.

No line directives

The line numbers provided by "tokenize" do not respect C line directives.

Insufficient tests

The module has been used somewhat, but the included tests do not exercise many of the features of C.

AUTHOR

Ben Bullock, <bkb@cpan.org>

Request

If you'd like to see this module continued, let me know that you're using it. For example, send an email, write a bug report, star the project's github repository, add a patch, add a ++ on Metacpan.org, or write a rating at CPAN ratings. It really does make a difference. Thanks.

COPYRIGHT & LICENCE

This package and associated files are copyright (C) 2012-2016 Ben Bullock.

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.