This is the first-stage compiler. It is implemented in compound Onramp Assembly and compiles Onramp Minimal C (omC).
This document describes the implementation. For a specification of the language it compiles, see Onramp Minimal C.
The lexer consumes characters from the input file and produces tokens. Each token is one of the following types:
- "a" -- alphanumeric (a keyword, a type name or a variable name)
- "n" -- number
- "c" -- character literal
- "s" -- string literal (not merged with adjacent string literals)
- "p" -- punctuation (a valid operator or other punctuation token)
- "e" -- end-of-file
The token types are represented by letters. This lets us compare token types using mix-type arguments in assembly.
The current token type and value are stored in the global variables lexer_type and lexer_value. The parser loads these directly where needed. It calls the lexer functions lexer_consume(), lexer_accept(), lexer_is(), lexer_take() and lexer_expect() to parse tokens.
The lexer does not use the typical C lexer hack. Instead the lexer just returns tokens of type "alphanumeric". It is up to the parser to determine whether such tokens are keywords, type names or variable names. The lexer is entirely context-free.
The lexer also handles the few preprocessor directives supported (#line and #pragma). The lexer does not handle comments; they must be stripped by an Onramp preprocessor.
Onramp Minimal C only supports the base types void, char and int, along with pointers to them: void*, int**, char*** and so on.
Since a complete type always prefixes a name, types and names are always parsed separately. This makes for much simpler parsing.
We also require that most qualifiers (typedef, extern, static) come first, and they can only be used at file scope. This means we don't have to worry about qualifiers during type parsing either. The only qualifier that can appear anywhere is const and it is ignored.
Types are stored in the low seven bits of a char. This makes it possible to work with types entirely in mix-type instruction arguments in assembly. Types have the following layout:
- bits 0-4: The pointer indirection count. This is 0 for non-pointers and 1 or more for pointers.
- bits 5-6: The basic type. 1 is void, 2 is char and 3 is int.
- bit 7: The l-value flag.
For example 0x10 is void, 0x21 is char*, 0x73 is an l-value of type int***, and so on.
All expression parsing functions emit code on-the-fly. The code they emit places the result of the expression in r0, and they return the type of the value the code places in r0. The stack is used for storage of all intermediate values. We essentially treat the bytecode as a stack machine, ignoring almost all numbered registers.
For example, parse_binary_expression() first parses the left-hand side into r0 and pushes it; then parses the right-hand side into r0; then pops the left-hand side to r1 and calculates the result into r0. Constants and variables are always loaded into r0; mix-type arguments are not used. An expression like (5 * 2) + (9 / 4) therefore emits code like this:
imw r0 5 ; 5
push r0
imw r0 2 ; 2
pop r1
mul r0 r1 r0 ; *
push r0
imw r0 9 ; 9
push r0
imw r0 4 ; 4
pop r1
div r0 r1 r0 ; /
pop r1
add r0 r1 r0 ; +The resulting code is essentially in reverse Polish notation.
This is 23 bytecode instructions. It is of course extremely inefficient, but it is also extremely simple, and it works. A major advantage of spilling all variables and expressions to the stack is that function calls don't need to do anything special to preserve registers.
When an expression returns an l-value, the emitted code places the address of the value in r0 instead of the value itself. It is effectively an additional indirection, except that this type can also be the target of an assignment expression. The function compile_dereference_if_lvalue() is used to convert an l-value into an r-value; you'll see this called wherever the real value is needed.
Local variable definitions are stored in a stack in locals.os. When a variable is defined, it is pushed onto this stack. When a block closes, all variable definitions in that block are popped.
The values of variables are stored in the current function's stack frame. Since we don't support arrays or structs, all variables take up exactly one word of space. A variable's index (plus one) is its position in words under the frame pointer.
void foo(int a) { // ---------------
char b; // | (prev. rfp) | <- rfp
{ // ---------------
void* c; // | a | <- rfp -4
} // ---------------
{ // | b | <- rfp -8
int* d; // ---------------
char** e; // | c/d/f | <- rfp -12
} // ---------------
int*** f; // | e | <- rfp -16
} // ---------------The definition of local variables and the opening and closing of blocks does not emit any instructions. The compiler instead tracks the maximum number of variables that were defined at any time in the function and reserves space for the full stack frame at the entry point of the function (see Function Generation below.)
The size of a function's stack frame can only be determined after the entire function has been parsed. Since we are a single-pass compiler, we need to emit code before we know the frame size.
We treat our output file as a stream so we can't seek back and patch the frame size in after the fact. We also don't have support in our linker to fill in values for us. We need to output the value after the function is compiled.
To keep things simple, we place the frame size in its own symbol after the function. The function prologue loads the frame size symbol on entry to create its stack frame. We emit the size afterwards once we've computed the stack usage of the full function.
For example, a function foo() is compiled like this:
=foo
enter
imw r9 ^_F_foo
ldw r9 rpp r9
sub rsp rsp r9
; ...
; ...
; ...
leave
ret
@_F_foo
<framesize>Every function compiled by this stage of compiler is structured this way so every function gets two symbols. This is a bit inefficient and adds a fair bit of linker overhead but it is very easy to implement for bootstrapping. It is also simple for our first stage code generator to optimize in cases where stack space is not needed.
When a literal string appears in a function, we assign the string a unique id, copy the string from the lexer and store it in an array. Once the function has been emitted, all strings are output as symbols and freed. For example, a function bar() that uses several strings is compiled like this:
=bar
enter
imw r9 ^_F_bar
ldw r9 rpp r9
sub rsp rsp r9
; ...
; ...
imw r0 ^_Sx0
add r0 rpp r0
; ...
; ...
imw r0 ^_Sx1
add r0 rpp r0
; ...
; ...
imw r0 ^_Sx2
add r0 rpp r0
; ...
; ...
leave
ret
@_F_bar
<framesize>
@_Sx0
"Alice"'00
@_Sx1
"Bob"'00
@_Sx2
"Carol"'00We don't support concatenation of adjacent string literals in this stage.