Skip to content

geluk/cocopy

Repository files navigation

A compiler for ChocoPy.

My main goal is to (nearly) fully implement ChocoPy as defined in the language reference. In addition, I'd like to experiment with some additional features, which are described below.

Running the compiler

Build the compiler first:

cargo build --release

Run target/release/cocopy.exe:

cocopy
A compiler for ChocoPy

USAGE:
    cocopy.exe [OPTIONS] <SUBCOMMAND>

OPTIONS:
    -h, --help                 Print help information
    -v, --verbose <VERBOSE>    [default: 1]

SUBCOMMANDS:
    check      Check a program for errors
    compile    Compile a program
    help       Print this message or the help of the given subcommand(s)
    run        Compile and run

To check, compile, or run a file, specify it as the first argument:

cocopy run ./examples/simple_add_program.py

The compilation step generates an out/out.exe file on Windows, assuming you have a Visual Studio 2022 installation with the C++ workload installed. Otherwise, you may need tweak and run build_win64.ps1 instead, which will manually link out.obj into an executable.

On Linux, the compiled executable is written to out/out. Note that you must have nasm and gcc installed. As with Windows, you can also use the build_linux64.sh script to assemble and link the executable.

Overview

Currently, the compiler is capable of compiling basic ChocoPy programs to assembly. The following steps are taken during compilation.

ChocoPy code

Start here.

a:int = 0
b:int = 9
a = b + 10 * 3
b = a * 100
if b > 10:
    print(b)

After lexing

The lexer emits a token stream indicating the type of each token (keyword, identifier, symbol, literal, etc..) and its value, if present. It also detects indentation changes, emitting special indent and dedent tokens whenever the indentation level changes.

"a" ":" "int" "=" "0" "\n"
"b" ":" "int" "=" "9" "\n"
"a" "=" "b" "+" "10" "*" "3" "\n"
"b" "=" "a" "*" "100" "\n"
"if" "b" ">" "10" ":" "\n"
<indent>
"print" "(" "b" ")" "\n"
<dedent>

After parsing and type checking

The parser combines tokens into expressions and statements, which together form an abstract syntax tree. The type checker verifies that all operations evaluate to the correct types.

The syntax tree is represented like so:

program:
  var_defs:
  - name: a
    type_spec: int
    value: integer(0)
  - name: b
    type_spec: int
    value: integer(9)
  statements:
  - assign:
      target: identifier(a)
      value:
        binary:
          lhs: identifier(b)
          op: add
          rhs:
            binary:
              lhs:
                literal: integer(10)
              op: multiply
              rhs:
                literal: integer(3)
  - assign:
      target: identifier(b)
      value:
        binary:
          lhs: identifier(a)
          op: multiply
          rhs:
            literal: integer(100)
  - if:
      condition:
        binary:
          lhs: identifier(b)
          op: greater_than
          rhs:
            literal: integer(10)
      body:
        statements:
        - evaluate:
            expression:
              function_call:
                name: print
                args:
                - identifier(b)

Intermediate code generation

The IL generator converts the syntax tree into three-address code. Complex expressions are flattened, generating lots of intermediates (%). To prevent variable reuse, variables receive a new subscript (^) every time they're reassigned. This will make it easier for the optimiser to reason about program flow.

function main
        a^1 = 0
        b^1 = 9
        %1 = 10 * 3
        %2 = b^1 + %1
        a^2 = %2
        %3 = a^2 * 100
        b^2 = %3
        if b^2 <= 10 goto if_end_1
        %4 = call print (b^2)
    if_end_1:

Intermediate code optimisation

The optimiser finds statements that can be removed or simplified, and rewrites the intermediate code.

function main
        %1 = 10 * 3
        %2 = 9 + %1
        %3 = %2 * 100
        if %3 <= 10 goto if_end_1
        call print (%3)
    if_end_1:

Assembly generation

Each instruction is converted to assembly. Operations may be broken up further into a store and apply instruction when they cannot be converted to a single assembly instruction. The register allocator assigns registers to variables, keeping track of where registers are used, and deallocating them as soon as their value is no longer needed.

main:
    push    rbp                 ; store base pointer

                                ; %1 = 10 * 3
    mov     rax, 10
    imul    rax, 3
                                ; %2 = 9 + %1
    mov     rbx, 9
    add     rbx, rax
                                ; %3 = %2 * 100
    mov     rcx, rbx
    imul    rcx, 100
                                ; if %3 <= 10 goto if_end_1
    mov     rax, 10
    cmp     rcx, rax
    jle     .if_end_1
                                ; call print (%3)
    push    rcx
    sub     rsp, 8
    call    print
    add     rsp, 8
    pop     rcx
  .if_end_1:
    xor     rax, rax            ; return 0
    pop     rbp                 ; restore previous base pointer
    ret                         ; return to caller

TODO

Lexical structure

  • Token lexing
    • Keywords
    • Symbols
    • Structure
      • Newline
      • Tab-based indentation
      • Space-based indentation
    • Identifiers
    • Integer literals
    • String literals
    • Comments
  • Structural normalisation (emitting final newlines and dedents)
  • Source code positions

Syntax

  • Literals
  • Variables
    • Type annotations
  • Functions
    • Function definition
    • Function body
    • global
    • nonlocal
  • Classes
    • Class body
  • Expressions
    • Literal
    • Identifier
    • Unary
    • Binary
    • Ternary
    • Function/method call
    • Operator precedence and associativity
    • Handle assignment targets differently from binary expressions
  • Statements
    • Expression evaluation
    • Assignment
    • if
      • Body
      • else
      • elif
    • while
    • for
    • pass
    • return
  • Blocks
  • Contextual error reporting
  • Split assignment target and expression
    • Create AST nodes
    • Create assignment target parsing functions

Type checking

  • Literals
  • Variables
    • Initialisation
  • Statements
    • Expression evaluation
    • Assignment
    • if
      • Body
      • else
      • elif
    • while
    • for
    • pass
    • return
  • Expressions
    • Unary
      • Negate
      • Not
    • Binary
      • Arithmetic
      • Integer comparison
      • Boolean comparison
      • Boolean combination
      • String operations
      • is
    • Ternary
    • Function application
    • Method call
    • Object construction
    • List display
    • List operators
    • Attribute access
    • Multiple assignment
    • Function application
  • Function definitions
  • Class definitions
  • Global environment
  • Overload resolution

Intermediate code generation

  • Literals
  • Statements
    • Expression evaluation
    • Assignment
    • if
      • Special-case comparison statements
      • Body
      • else
      • elif
    • while
    • for
    • pass
    • return
  • Expressions
    • Literal
    • Identifier
    • Binary
    • Procedure call
    • Object construction
    • List display
    • List operators
    • Attribute access
    • Multiple assignment
  • Function definitions
  • Class definitions

Optimisation passes

  • Remove unused assignments
  • Inline assigned variables when posssible
  • Remove φ-functions
  • When unifying two names, prefer variables over temporaries

Target code generation

  • Proof-of-concept
    • Generate assembly code
    • Assemble object files with nasm
    • Linking with gcc or link.exe
  • Convert intermediate code to assembly
    • Integer arithmetic
    • Procedure calls
    • Procedure return
    • Boolean operations
    • Goto
    • Conditional goto
    • Array lookup
    • Pointer dereference
    • Everything else
  • Register allocation
    • Naive register allocation
    • Spill to stack when registers are exhausted
    • Reuse registers after final reference
    • Free registers when required by a different operation
    • Save and restore registers between function calls

Operational semantics:

  • Everything

Additional goals

  • Multiple backends (native, LLVM)
  • Using Hindley-Milner type inference to make all type annotations optional
  • Good error reporting
  • More features?

Additional reading

ChocoPy

Type checking

Code generation

About

A compiler for ChocoPy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages