Ben Bullock


C::Tokenize - reduce a C file to a series of tokens


    # Remove all C preprocessor instructions from a C program:
    use C::Tokenize '$cpp_re';
    $c =~ s/$cpp_re//g;

    # Print all the comments in a C program:
    use C::Tokenize '$comment_re';
    while ($c =~ /($comment_re)/) {
        print "$1\n";


This module provides a tokenizer which breaks C source code into its smallest meaningful components, and the regular expressions which match each of these components. For example, the module supplies a regular expression "$comment_re" which matches a C comment line.


The following regular expressions can be imported from this module using, for example,

    use C::Tokenize '$cpp_re'

to import $cpp_re.

None of the following regular expressions does any capturing. If you want to capture, add your own parentheses around the regular expression.


Match /* */ comments.


Match // comments.


Match both /* */ and // comments.


Match a C preprocessor instruction.


Match a character constant, such as 'a' or '\-'.


Match an operator such as + or --.


Match a number, either integer, floating point, or hexadecimal. Does not do octal yet.


Match a word, such as a function or variable name or a keyword of the language.


Match other syntactic characters such as { or [.


Match a single C string constant such as "this".


Match a full-blown C string constant, including compound strings "like" "this".


Match a C reserved word like auto or goto.


Match an include statement which uses double quotes, like #include "some.c".



@Fields contains a list of all the fields which are extracted by "tokenize".



    my $out = decomment ('/* comment */');
    # $out = " comment ";

Remove the traditional C comment marks /* and */ from the beginning and end of a string, leaving only the comment contents. The string has to begin and end with comment marks.


    my $tokens = tokenize ($file);

Convert $file into a series of tokens. The return value is an array reference which contains hash references. Each hash reference corresponds to one token in the C file. Each token contains the following keys:


Any whitespace which comes before the token (called "leading whitespace").


The type of the token, which may be


A comment, like

    /* This */


    // this.

A C preprocessor instruction like

    #define THIS 1


    #include "That.h".

A character constant, like '\0' or 'a'.


A piece of C "grammar", like { or ] or ->.


A number such as 42,


A word, which may be a variable name or a function.


A string, like "this", or even "like" "this".


A C reserved word, like auto or goto.

All of the fields which may be captured are available in the variable "@fields" which can be exported from the module:

    use C::Tokenize '@fields';

The value of the type. For example, if $token->{name} equals 'comment', then the value of the type is in , $token->{comment}.

    if ($token->{name} eq 'string') {
        my $c_string = $token->{string};

The line number of the C file where the token occured. For a multi-line comment or preprocessor instruction, the line number refers to the final line.


    use C::Tokenize ':all';

exports all the regular expressions from the module.



Octal not parsed

It does not parse octal expressions.

No trigraphs

No handling of trigraphs.

Requires Perl 5.10

This module uses named captures in regular expressions, so it requires Perl 5.10 or more.

No line directives

The line numbers provided by "tokenize" do not respect C line directives.

Insufficient tests

The module has been used somewhat, but the included tests do not exercise many of the features of C.


Ben Bullock, <>


If you'd like to see this module continued, let me know that you're using it. For example, send an email, write a bug report, star the project's github repository, add a patch, add a ++ on, or write a rating at CPAN ratings. It really does make a difference. Thanks.


This package and associated files are copyright (C) -2015 Ben Bullock.

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.


This defines the terminology used in this document.

Convenience function

In this document, a "convenience function" indicates a function which solves some of the problems, some of the time, for some of the people, but which may not be good enough for all envisaged uses. A convenience function is an 80/20 solution, something which solves (about) 80% of the problems with 20% of the effort. Something which does the obvious things, but may not do all the things you might want, a time-saver for the most basic usage cases.


In this document, the section BUGS describes possible deficiencies, problems, and workarounds with the module. It's not a guide to bug reporting, or even a list of actual bugs. The name "BUGS" is the traditional name for this sort of section in a Unix manual page.