NAME

C::Tokenize - reduce a C file to a series of tokens

SYNOPSIS

# Remove all C preprocessor instructions from a C program:
my $c = <<EOF;
#define X Y
#ifdef X
int X;
#endif
EOF
use C::Tokenize '$cpp_re';
$c =~ s/$cpp_re//g;
print "$c\n";

produces output

int X;

(This example is included as synopsis-cpp.pl in the distribution.)

# Print all the comments in a C program:
my $c = <<EOF;
/* This is the main program. */
int main ()
{
    int i;
    /* Increment i by 1. */
    i++;
    // Now exit with zero status.
    return 0;
}
EOF
use C::Tokenize '$comment_re';
while ($c =~ /($comment_re)/g) {
    print "$1\n";
}

produces output

/* This is the main program. */
/* Increment i by 1. */
// Now exit with zero status.

(This example is included as synopsis-comment.pl in the distribution.)

VERSION

This documents version 0.19 of C::Tokenize corresponding to git commit 98c04e11199496e4e99e4c6e3eb97b4efaf8636b released on Tue Aug 5 18:14:49 2025 +0900.

DESCRIPTION

This module provides a tokenizer, "tokenize", which breaks C source code into its smallest meaningful components, and the regexes which match each of these components. For example, "$comment_re" matches a C comment.

As well as components of C, it supplies regexes for local include statements, "$include_local", and C variables, "$cvar_re", as well as extra functions, like "decomment" to remove the comment syntax of traditional C comments, and "strip_comments", which removes all comments from a C program.

REGULAR EXPRESSIONS

The following regular expressions can be imported from this module using, for example,

use C::Tokenize '$cpp_re'

to import $cpp_re.

The following regular expressions do not capture, except where noted. To capture, add your own parentheses around the regular expression.

Comments

$trad_comment_re: Match /* */ comments.
$cxx_comment_re: Match // comments.
$comment_re: Match both /* */ and // comments.

See also "decomment" for converting a comment to a string, and "strip_comments" for removing all comments from a program.

Preprocessor instructions

$cpp_re

Match all C preprocessor instructions, such as #define, #include, #endif, and so on. A multiline preprocessor instruction is matched as #one piece.

$include_local

Match an include statement which uses double quotes, like #include "some.c".

This captures the entire statement in $1 and the file name in $2.

This was added in version 0.10 of C::Tokenize.

$include

Match any include statement, like #include <stdio.h>.

This captures the entire statement in $1 and the file name in $2.

use C::Tokenize '$include';
my $c = <<EOF;
#include <this.h>
#include "that.h"
EOF
while ($c =~ /$include/g) {
    print "Include statement $1 includes file $2.\n";
}

produces output

Include statement #include <this.h> includes file this.h.
Include statement #include "that.h" includes file that.h.

(This example is included as includes.pl in the distribution.)

This was added in version 0.12 of C::Tokenize.

Values

$octal_re: Match an octal number, which is a number consisting of the digits 0 to 7 only which begins with a leading zero.
$hex_re: Match a hexadecimal number, a number with digits 0 to 9 and letters A to F, case insensitive, with a leading 0x or 0X and an optional trailing L or l for long.
$decimal_re: Match a decimal number, either integer or floating point.
$number_re: Match any number, either integer, floating point, hexadecimal, or octal.
$char_const_re: Match a character constant, such as 'a' or '\-'.
$single_string_re: Match a single C string constant such as "this".
$string_re: Match a full-blown C string constant, including compound strings "like" "this".

Operators, variables, and reserved words

$operator_re

Match an operator such as + or --.

$word_re

Match a word, such as a function or variable name or a keyword of the language. Use "$reserved_re" to match only reserved words.

$grammar_re

Match non-operator syntactic characters such as { or [.

$reserved_re

Match a C reserved word like auto or goto. Use "$word_re" to match non-reserved words.

$cvar_re

Match a C variable, for example anything which may be an lvalue or a function argument. It does not capture the result.

use C::Tokenize '$cvar_re';
my $c = 'func (x->y, & z, ** a, & q);';
while ($c =~ /[,\(]\s*($cvar_re)/g) {
    print "$1 is a C variable.\n";
}

produces output

x->y is a C variable.
& z is a C variable.
** a is a C variable.
& q is a C variable.

(This example is included as cvar.pl in the distribution.)

Because in theory this can contain very complex things, this regex is somewhat heuristic and there are edge cases where it is known to fail. See t/cvar_re.t in the distribution for examples.

This was added in version 0.11 of C::Tokenize.

VARIABLES

@fields

The exported variable @fields contains a list of all the fields which are extracted by "tokenize".

FUNCTIONS

decomment

my $out = decomment ('/* comment */');
# $out = " comment ";

Remove the traditional C comment marks /* and */ from the beginning and end of a string, leaving only the comment contents. The string has to begin and end with comment marks.

tokenize

my $tokens = tokenize ($text);

Convert the C program code $text into a series of tokens. The return value is an array reference which contains hash references. Each hash reference corresponds to one token in the C text. Each token contains the following keys:

leading

Any whitespace which comes before the token (called "leading whitespace").

type

The type of the token, which may be

comment

A comment, like

/* This */

// this.

cpp

A C preprocessor instruction like

#define THIS 1

#include "That.h".

char_const

A character constant, like '\0' or 'a'.

grammar

A piece of C "grammar", like { or ] or ->.

number

A number such as 42,

word

A word, which may be a variable name or a function.

string

A string, like "this", or even "like" "this".

reserved

A C reserved word, like auto or goto.

All of the fields which may be captured are available in the variable "@fields" which can be exported from the module:

use C::Tokenize '@fields';

$name

The value of the type. For example, if $token->{name} equals 'comment', then the value of the type is in $token->{comment}.

if ($token->{name} eq 'string') {
    my $c_string = $token->{string};
}

line

The line number of the C file where the token occured. For a multi-line comment or preprocessor instruction, the line number refers to the final line.

strip_comments

my $no_comment = strip_comments ($c);

This removes all comments from its input while preserving line breaks.

use C::Tokenize 'strip_comments';
my $c = <<EOF;
char * not_comment = "/* This is not a comment */";
int/* The X coordinate. */x;
/* The Y coordinate.
   See https://en.wikipedia.org/wiki/Cartesian_coordinates. */
int y;
// The Z coordinate.
int z;
EOF
print strip_comments ($c);

produces output

char * not_comment = "/* This is not a comment */";
int x;
 

int y;

int z;

(This example is included as strip-comments.pl in the distribution.)

This function was moved to this module from XS::Check in version 0.14.

This function can also be used to strip C-style comments from JSON without removing string contents:

use C::Tokenize 'strip_comments';
my $json =<<EOF;
{
/* Comment comment comment */
"/* not comment */":"/* not comment */",
"value":["//not comment"] // Comment
}
EOF
print strip_comments ($json);

produces output

{
 
"/* not comment */":"/* not comment */",
"value":["//not comment"]  
}

(This example is included as strip-json.pl in the distribution.)

EXPORTS

Nothing is exported by default.

use C::Tokenize ':all';

exports all the regular expressions and functions from the module, and also "@fields".

BUGS

No trigraphs

No handling of trigraphs.

This is issue 4.

Requires Perl 5.10

This module uses named captures in regular expressions, so it requires Perl 5.10 or more.

No line directives

The line numbers provided by "tokenize" do not respect C line directives.

This is issue 11.

Insufficient tests

The module has been used somewhat, but the included tests do not exercise many of the features of C.

$include and $include_local assume the included file end with .h

The C language does not impose a requirement that included file names end in .h. You can include a file with any name. However, the regexes "$include" and "$include_local" insist on a final .h.

$cvar_re misses some cases

See the discussion under "$cvar_re".

AUTHOR

Ben Bullock, <benkasminbullock@gmail.com>

COPYRIGHT & LICENCE

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.

To install C::Tokenize, copy and paste the appropriate command in to your terminal.

cpanm

cpanm C::Tokenize

CPAN shell

perl -MCPAN -e shell
install C::Tokenize

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

VERSION

DESCRIPTION

REGULAR EXPRESSIONS

Comments

Preprocessor instructions

Values

Operators, variables, and reserved words

VARIABLES

@fields

FUNCTIONS

decomment

tokenize

strip_comments

EXPORTS

SEE ALSO

BUGS

AUTHOR

COPYRIGHT & LICENCE

Module Install Instructions

Keyboard Shortcuts