The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

README.txt - Readme file for PIRC compiler.

DESCRIPTION

PIRC is a fresh implementation of the PIR language using Bison and Flex. Its main features are:

  • thread-safety, so it is reentrant.

  • instruction selection, implemented in the parser.

  • constant folding, implemented in the parser.

Compiling and Running

Windows using Microsoft Visual Studio

 cl /c /I..\..\..\include *.c

 link /nodefaultlib main.obj pircompunit.obj pirlexer.obj pirparser.obj \
   pirsymbol.obj pircompiler.obj kernel32.lib msvcrt.lib ..\..\..\libparrot.lib

When running PIRC, it needs the shared library libparrot; an easy way to do this is copy libparrot.dll in the Parrot root directory to compilers/pirc/new.

Running PIRC is as easy as:

 pirc test.pir

Linux using GCC

 cc -c -I../../../include *.c

 cc main.o pircompiler.o pircompunit.o pirlexer.o pirparser.o \
     pirsymbol.o ../../../blib/lib/libparrot.so.X.Y.Z -o pirc

where X, Y and Z together represent the version of Parrot (for instance, at the time of writing this, the version is 0.7.0, so that X=0, Y=7 and Z=0.)

When running PIRC, it needs the shared library libparrot; in order to let PIRC find it, set the path as follows:

 export LD_LIBRARY_PATH=<path to parrot>/parrot/blib/lib

Running is as easy as:

 ./pirc test.pir

Overview

The new Bison/Flex based implementation of the PIR compiler is designed as a three-stage compiler:

1. Heredoc preprocessor
2. Macro preprocessor
3. PIR compiler

Heredoc preprocessing

The heredoc preprocessor takes the input as written by the PIR programmer, and flattens out all heredoc strings. An example is shown below to illustrate this concept:

The following input:

 .sub main
   $S0 = <<'EOS'
 This is a heredoc string
   divided
     over
       five
         lines.
 EOS
 .end

is transformed into:

 .sub
   $S0 = "This is a heredoc string\n  divided\n    over\n      five\n        lines.\n"
 .end

Macro preprocessing

The macro layer basically implements text replacements. The following directives are handled:

.include
.macro
.macro_const

.include

The .include directive takes a string argument, which is the name of a file. The contents of this file are inserted at the point where the .include directive is written. To illustrate this, consider the following example:

 main.pir:
 ========================
 .sub main
   print "hi\n"
   foo()
 .end

 .include "lib.pir"
 ========================

 lib.pir:
 ========================
 .sub foo
   print "foo\n"
 .end
 ========================

This will result in the following output:

 .sub main
   print "hi\n"
   foo()
 .end

 .sub foo
   print "foo\n"
 .end

.macro

The macro directive starts a macro definition. The macro preprocessor implements the expansion of macros. For instance, given the following input:

 .macro say(msg)
   print .msg
   print "\n"
 .endm

 .sub main
   .say("hi there!")
 .end

will result in this output:

 .sub main
   print "hi there!"
   print "\n"
 .end

.macro_const

The .macro_const directive is similar to the .macro directive, except that a .macro_const is just a simplified .macro; it merely gives a name to some constant:

 .macro_const PI 3.14

 .sub main
   print "PI is approximately: "
   print .PI
   print "\n"
 .end

This will result in the output:

 .sub main
   print "PI is approximately: "
   print 3.14
   print "\n"
 .end

PIR Compiler

The output of the macro preprocessor is fed to the actual PIR compiler. The PIR compiler builds a data structure during the parsing phase (often referred to as an Abstract Syntax Tree - AST). The rest of this section describes the features of the PIR compiler.

Instruction selection

As Parrot instructions are polymorphic, the PIR compiler is responsible for selecting the right variant of the instruction. The selection is based on the types of the operands. For instance:

 set $I0, 42

will select the set_i_ic instruction: this is the set instruction, taking an integer (i) result operand and an integer constant (ic) operand. Other examples are:

 $P0[1] = 42           --> set_p_kic_ic # kic = key integer constant
 $I0 = $P0["hi"]       --> set_i_p_kc   # kc = key constant from constant table
 $P1 = new "Hash"      --> new_p_sc     # sc = string constant

Constant folding

Expressions that can be evaluated at compile-time are pre-evaluated, saving calculations during runtime. Some constant-folding is required, as Parrot depends on this. For instance:

 add $I0, 1, 2

is not a valid Parrot instruction; there is no add_i_ic_ic instruction. Instead, this will be translated to:

 set $I0, 3

which, as was explained earlier, will select the set_i_ic instruction.

The conditional branch instructions are also pre-evaluated, if possible. For instance, consider the following statement:

 if 1 < 2 goto L1

It is clear during compile time, that 1 is smaller than 2; so instead of evaluating this during runtime, we know for sure that the branch to label L1 will be made, effectively replacing the above statement by:

 goto L1

Likewise, if it's clear that certain instructions don't have any effect, they can be removed altogether:

 if 1 > 2 goto L1        --> nop  # nop is no opcode.
 $I0 = $I0 + 0           --> nop

Another type of optimization is the selection of (slightly) more efficient variants of instructions. For instance, consider the following instruction:

 $I0 = $I0 + $I1

which is actually syntactic sugar for:

 add $I0, $I0, $I1

In C one would write (ignoring the fact that $I0 and $I0 are not a valid C identifiers):

 $I0 += $I1

which is in fact valid PIR as well. When the PIR parser sees an instruction of this form, it will automatically select the variant with 2 operands instead of the 3-operand variant. So:

 add $I0, $I0, $1    # $I0 is an out operand

will be optimized, as if you had written:

 add $I0, $I1        # $I0 is an in/out operand

The PIR parser can do even more improvements, if it sees opportunity to do so. Consider the following statement:

 $I0 = $I0 + 1

or, in Parrot assembly syntax:

 add $I0, $I0, 1

Again, in C one would write (again ignoring the valid identifier issue): $I0++, or in other words, incrementing the given identifier. Parrot has inc and dec instructions built-in as well, so that the above statement $I0 = $I0 + 1 can be optimized to:

 inc $I0

Vanilla Register Allocator

The PIR compiler implements a vanilla register allocator. This means that each declared .local or .param symbol, and each PIR register ($Px, $Sx, $Ix, $Nx) is assigned a unique PASM register, that is associated with the original symbol or PIR register throughout the subroutine.

Any further optimizations on register usage can be implemented by writing a register allocator that takes this initial register allocation as input, and generating a more optimized register usage. Research and benchmarking is needed to decide whether this yields more efficient bytecode. In the end it is a choice between compile-time overhead (register allocation) or runtime memory overhead (more register space needed per sub).

The implementation of the vanilla register allocator is done in the PIR symbol management module (pirsymbol.c).

Status

The PIR parser is complete, but should be tested intensively. The back-end creates a data structure representing the input. Currently, only (almost working) PASM output is generated, but eventually a Parrot Byte Code (PBC) file should be generated. In order to do this, we need a proper API to generate the appropriate data structures (such as Parrot PackFile and friends).

IMPLEMENTATION

The directory compilers/pirc has a number of subdirectories:

doc - contains documentation.
heredoc - contains the implementation of the heredoc preprocessor
macro - contains the implementation of the macro layer
new - contains the Bison/Flex implementation of PIRC
src - contains the hand-written, recursive-descent implementation of PIRC. Note that this is no longer maintained at the moment.
t - for tests.

NOTES

Usage

Currently the different compilers/pre-processors are located in different directories. The different pre-processors are invoked from the main driver in pirc.c. The latter assumes all three processors are compiled, as the following executables:

 heredoc pre-processor: hdocprep
 macro pre-processor:   macroparser

Running a file through the whole PIR compiler is then done as follows:

 $ ./pirc test.pir

When you want to run the heredoc pre-processor only, do this:

 $ ./pirc -H test.pir

When you want to pre-process the file only (heredoc + macro parsing), do this:

 $ ./pirc -E test.pir

Cygwin processable lexer spec.

The file pir.l from which the lexer is generated is not processable by Cygwin's default version of Flex. In order to make a reentrant lexer, a newer version is needed, which can be downloaded from the link below.

http://sourceforge.net/project/downloading.php?groupname=flex&filename=flex-2.5.33.tar.gz&use_mirror=belnet

Just do:

 $ ./configure
 $ make

Then make sure to overwrite the supplied flex binary.

BUGS

Having a look at this implementation would be greatly appreciated, and any resulting feedback even more :-)

  • All, except the first heredoc argument, contains 1 newline character too many. Heredoc parsing is a bit complex, and there might be many other issues.

  • Memory management needs to be improved.

  • Braced macro argument handling needs a lot of testing.

  • The current design does not allow any heredocs in .included files, because the .include directive is implemented in the second phase. This issue should be solved at some point. A possible solution is to combine the heredoc and macro preprocessors, however this might result in unmaintainable code. Another solution would be to implement the heredoc preprocessor in plain C, as opposed to the Flex implementation.

FUTURE WORK: HELP WANTED

Eventually, either IMCC needs to be fixed rigorously, or, rewritten altogether. PIRC is an attempt to do the latter. The following things need to be considered when replacing IMCC with PIRC:

  • PIR sub storage

    PIR subs are stored as PMC constants in the constant table, but it is not clear how exactly this is to be done.

  • bytecode generation

    There must be a proper bytecode API for PIRC to use.

  • :immediate and related flags

    Flags such as :immediate must be implemented; a sub that is marked with the :immediate flag must be run immediately after compilation.

At this moment, the following things are unclear to me; if anybody can answer these, that'd be helpful:

  • how are labels represented, as addresses/offsets (just plain integers)?

  • related: how is a jump/goto/branch instruction implemented? Will this take a integer operand?

  • how are named args/params handled/represented in bytecode?

The following are some ideas for the near future:

  • simplify the instruction struct by removing the union, so that invocation struct objects are no longer stored there. Once completely parsed, a invocation struct should be converted into a stream of normal instructions. This way, they will also participate in the vanilla register allocator.

  • when parsing is done, we know the total number of instructions (or we can easily count this, by incrementing a counter in the instruction constructor). Then we can create an array, and convert the linked list of instructions into an array of instructions. This would allow to easily index this array and calculate offsets (labels etc.) and such.

SEE ALSO

See also:

  • languages/PIR for a PGE based implementation.

  • compilers/pirc, a hand-written, recursive-descent PIR parser.

  • compilers/imcc, the current standard PIR implementation.

  • docs/imcc/syntax.pod for a description of PIR syntax.

  • docs/imcc/ for more documentation about the PIR language.

  • docs/pdds/pdd19_pir.pod for the PIR design document.