Skip to content

gboduljak/mcc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mcc

"clang done worse :P"

mcc is a toy compiler compiling a (relatively) large subset of C language to LLVM IR.

mcc is written entirely in Haskell and it started as a hobby project, done by @me and @its-sami.

There are a few differences between the subset of C in specification of C language and the language being compiled by mcc. mcc compiles the subset which we call mini C.

mini C is an extension of the programming language used in compiling techniques course by The University of Edinburgh.

Features of mini C are:

  • the syntax exactly the same as C
  • mini C's primitive types are only void, int, double, char
  • structs and arrays of arbitrary dimension are supported
  • passing structs by reference
  • passing structs by value

Key differences between mini C and C are:

  • mini C has a built-in includes preprocessor which either resolves the compilation order of given source files or logs the presence of a cycle
  • mini C does not support function pointers
  • mini C does not support many implicit casts
  • mini C does not support variable declaration and assignment in the same line
  • mini C does not have formal and operational semantics :(

Lexical structure of mini C is equivalent to the lexical structure of the relevant subset of C. Syntax of mini C is formally specified by the grammar.

Architecture

mcc follows the classical guidelines of compiler construction, adapted to the functional programming paradigm.

Therefore, mcc is split in five stages:

  • lexical analysis
  • preprocessing includes
  • syntax analysis (parsing)
  • semantic analysis (typechecking and abstract syntax tree rewriting)
  • LLVM code generator

In design of a mcc's frontend, we employ various functional programming concepts.

We make use of monads and monad transformers to isolate effectfull computation. This means that each stage is encapsulated in its own monad stack and exposed as a pure function.

Examples:

Pratt Parser is a monad stack exposed as a pure function parse.

data ParserState = ParserState
  { input :: [T.Token],
    offset :: Int,
    errors :: [ParserError]
  }

type Parser a = ExceptT ParserError (State ParserState) a

parse :: String -> [T.Token] -> Either (ParseErrorBundle TokenStream Void) Ast.Program
parse file tokens = ...

Codegen Monad exposed as a pure function compile.

type LLVM = ModuleBuilderT (State (Env Operand))
type Codegen = IRBuilderT LLVM

compile :: String -> SProgram -> LLVM.AST.Module
compile name program = evalState (buildModuleT (cs name) (generateProgram program)) emptyEnv

We use parser combinators to implement both lexical and syntax analysis.

Lexical Analysis is implemented in two ways. We have an ad hoc and parser combinator version. From theoretical standpoint, ad hoc is a classic hand-coded DFA lexer. Parser combinator lexer is a top down parser for regular grammar describing the lexical structure of mini C.

We use alex generated lexer as a ground truth in unit testing, but it is not a part of compiler itself.

Prior to the parsing stage, mcc constructs a dependency graph induced by includes and attempts to construct the topological ordering, reporting if such an ordering is impossible (i.e cyclic dependencies).

An example program

  • main.c
#include "infrastructure.c"

int main () {
  printf("I am cyclic :(");
}
  • data.c
#include "infrastructure.c"
  • infrastructure.c
#include "data.c"

mcc will output the following graph in the build folder (build/compilation-order.dot).

The algorithm used is a classic DFS topological ordering, described in CLRS.

Prior to the syntax analysis, we manually rewrote the grammar to remove ambiguity. This is done by stratifying expression productions into precedence levels. We also factored the grammar where it improved readability. You can find grammars here.

From theoretical standpoint, every parser we implemented is a predictive LL(K) recursive descent.

The default parser (method of choice) is a variant of Pratt top down operator precedence parser with custom error recovery. Error recovery is classical panic mode based on FOLLOW sets of expression, statement and construct level productions.

There are two parsers written using Megaparsec parser combinators. However, the default parser used is a custom, parser combinator parser based on Pratt's top down operator precedence algorithm.

Pratt is implemented from scratch, delivering (probably) best error messages and overall the best error recovery.

Example:

struct problematic
{
  int double +
  double x;
};

double f() {
  int x;
  int y;
  z = x + ;
}

int main() {
  if (i <)
  {
    /* code */
  
}

Parser combinator approach proved to be very readable. Most of subparsers are very simple descriptions of parsing actions to be performed, closely resembling grammar itself.

while :: Parser Ast.Statement
while = do
  off <- getOffset
  expect L.While
  cond <- between (expect L.LParen) (expect L.RParen) expr
  body <- statement
  return (Ast.While cond body off)

statement :: Parser Ast.Statement
statement =
  exprStmt <|> varDeclStmt
    <|> blockStmt
    <|> returnStmt
    <|> while
    <|> for
    <|> if'
  where
    exprStmt = do exp <- expr; expect Semi; return (Ast.Expr exp (getExprOff exp))
    blockStmt = do block' <- block; Ast.BlockStatement block' <$> getOffset
    returnStmt = do ret <- return'; expect Semi; return ret
    varDeclStmt = do
      off <- getOffset
      typ <- type'
      nameId <- ident
      decl <- varDecl off typ (varName nameId)
      return (Ast.VarDeclStatement decl off)
      where
        varName (Ast.Ident name _) = name
        varName _ = undefined

Main reason why we decided to make Pratt parser default is significantly better error recovery, which is still far away from being 'good' :)

Pratt parser was very interesting to implement and it required development of various custom combinators, such as lookchainl1 which is based on chainl1 combinator described in Monadic parsing paper by E.Meijer and G.Hutton. This combinator conveniently captures left recursion ocurring in expressions.

lookchainl1 :: Parser a -> (L.Lexeme -> Bool) -> Parser (a -> a) -> Parser a
lookchainl1 p pred op = do left <- p; rest left
  where
    rest left =
      lookahead >>= \lexeme -> do
        if pred lexeme
          then do
            f <- op
            rest (f left)
          else do return left

We use happy generated LR(k) parser as a ground truth in unit testing, but it is not a part of compiler itself.

Semantic Analysis

Both Semantic Analysis and Code Generation rely heavily on generic monadic Symbol Table, used to capture lexical scoping and provide convenient interface for scoping management.

Interesting note is that Symbol Table does not exist as a data structure itself, but it is a collection of monadic actions operating on State Monads storing Scoping Environments.

data Scope a = Scope
  { id :: ScopeId,
    parentId :: Maybe ScopeId,
    symbolTable :: Map String a
  }
  deriving (Show, Eq)

class ScopingEnv e where 
    scopes :: e a -> Map ScopeId (Scope a)
    currentScopeId :: e a -> ScopeId
    modifyScopes :: e a -> Map ScopeId (Scope a) -> e a
    modifyCurrentScopeId :: e a -> ScopeId -> e a

enterScope defined on any scoping state monad.

enterScope :: forall e a m .
  (
    ScopingEnv e,
    MonadState (e a) m,
    Show a,
    Eq a
  ) =>  m (Scope a)
enterScope = do
  scopes <- gets scopes
  scopeId <- gets currentScopeId
  newScope <- createScope (Just scopeId)
  let newScopeId = id newScope
  let newScopes = Map.insert (id newScope) newScope scopes
  modify (`modifyScopes` newScopes)
  modify (`modifyCurrentScopeId` newScopeId)
  return newScope

Semantic Analyser is module divided into several smaller analysers:

Each analyser is a module exporting monadic semantic analysis actions which usually destruct nodes of abstract syntax tree.

Semantic analysis implements classic tree walk typechecking, adapted to functional programming paradigm. We do not employ attributed grammars. Instead, we directly implement typing rules, relying on Haskell's powerful pattern matching capabilities.

Example of pattern matching on types and expression structure.

analyseExpr expr@(Assign left right _) = do
  left'@(leftTyp, leftExpr) <- analyseExpr left
  right'@(rightTyp, _) <- analyseExpr right

  if (not . isLValue) leftExpr
    then
      registerError (AssignmentError left right)
        >> return (Any, SAssign left' right')
    else case (leftTyp, rightTyp) of
      (Any, _) -> return (Any, SAssign left' right')
      (_, Any) -> return (Any, SAssign left' right')
      (_, Scalar (PrimitiveType Void 1)) -> do
        if isPointer leftTyp
          then return (leftTyp, SAssign left' right')
          else do
            registerError (AssignmentError left right)
            return (Any, SAssign left' right')
      (_, _) -> do
        if leftTyp == rightTyp && (not . isArray) leftTyp -- arrays are not assignable
          then return (leftTyp, SAssign left' right')
          else
            registerError (AssignmentError left right)
              >> return (Any, SAssign left' right')

Apart from just typechecking, we perform:

  • rewriting of pointer indirection (->) into a field access on dereferenced expression
  • we rewrite all loops into do while to simplify code generation

Rewriting loops into do-whiles.

rewriteAsDoWhile :: SExpr -> SExpr -> SStatement -> SExpr -> SStatement
rewriteAsDoWhile init' (_, SEmptyExpr) body' incr' =
  SBlockStatement
    ( blockify
        [ SExpr init',
          SDoWhile
            (Scalar (PrimitiveType Int 0), SLitInt 1)
            (SBlockStatement $ blockify [body', SExpr incr'])
        ]
    )
rewriteAsDoWhile init' cond' body' incr' =
  SBlockStatement
    ( blockify
        [ SExpr init',
          SIf
            cond'
            ( SDoWhile
                cond'
                (SBlockStatement $ blockify [body', SExpr incr'])
            )
            Nothing
        ]
    )

blockify :: [SStatement] -> SBlock
blockify stmts = SBlock [stmt | stmt <- stmts, stmt /= emptyStmt]
  where
    emptyStmt = SExpr (voidTyp, SEmptyExpr)

Rewriting struct pointer indirects into field accesses.

analyseExpr expr@(Ast.Indirect targetExpr field off) = do
  sexpr'@(typ, sexpr) <- analyseExpr targetExpr
  case typ of
    (Scalar (StructType name 1)) -> do
      maybeStruct <- lookupStruct name
      case maybeStruct of
        (Just struct) -> case getFields field struct of
          [SVar typ' _] -> return (typ', rewriteAsDeref typ sexpr' field)
      ...
    rewriteAsDeref typ accessExpr@(accessTyp, _) field =
      LVal
        ( SFieldAccess
            (Semant.Type.decreasePointerLevel accessTyp 1, LVal (SDeref accessExpr))
            field
        )

As in Semantic Analyser, we divide code generation in smaller code generation modules:

We use llvm-hs-pure library providing implementation of monadic module and IR builder monad transformers. We transform those over our codegen environment state monad, providing the symbol table in which we store LLVM operands.

type LLVM = ModuleBuilderT (State (Env Operand))
type Codegen = IRBuilderT LLVM

Code generation is essentially emission of LLVM basic blocks. Usually, code generation destructs semantic attributed syntax tree and emits relevant LLVM instructions, possibly creating new blocks.

generateStatement (SIf cond conseq alt) = do
  conseqName <- freshName (cs "if.conseq")
  altName <- freshName (cs "if.alt")
  mergeName <- freshName (cs "if.merge")

  condExpr <- generateExpression cond
  thruty <- L.icmp L.IntegerPredicate.NE condExpr (L.int32 0)
  L.condBr thruty conseqName altName

  L.emitBlockStart conseqName
  generateStatement conseq
  generateTerm (L.br mergeName)

  L.emitBlockStart altName
  generateMaybeStatement alt 
  generateTerm (L.br mergeName)

  L.emitBlockStart mergeName
generateBinop :: SExpr -> Operand -> Operand -> Codegen Operand
generateBinop expr@(_, SBinop left@(leftTyp, _) Add right@(rightTyp, _)) leftOp rightOp
  | isPointer leftTyp && isInt rightTyp = L.gep leftOp [rightOp]
  | isInt leftTyp && isPointer rightTyp = L.gep rightOp [leftOp]
  | isInt leftTyp = L.add leftOp rightOp
  | isChar leftTyp = L.add leftOp rightOp
  | isDouble leftTyp = L.fadd leftOp rightOp
  | otherwise = error ("semantic analysis failed on:" ++ show expr)

We emit packed structs.

Build & Run

To build & run this project, you will need a Haskell Stack. If you want to compile LLVM into executables, you will need LLVM toolchain installation.

The following sequence of commands should setup the project:

  • clone the project
  • run stack build in the root folder

To start the compiler, run stack run in the root folder on some program you want to compile. Those are usually in tests-cases folder. Using stack install, you can install mcc in your path.

mcc supports some options. By default, mcc creates a build folder, outputs dependency graph, tokens, abstract syntax trees, semantic analysis trees and compiled LLVM IR.

To run tests, execute stack test in the root folder.

Appendix - Syntax of mini C (see grammars folder)

program ::= (include)* (construct)* EOF

construct ::= structdecl
            | vardecl
            | funcdecl
            | funcdefn

include ::= Include LitString

structdecl ::= Struct Ident LBrace (vardecl)+ RBrace Semi

vardecl ::= type Ident [LBrack LitInt RBrack]* Semi

funcdecl ::= type Ident LParen formals RParen Semi
           | type Ident LParen RParen Semi

funcdefn ::= type Ident LParen formals RParen block

type ::= primitive_type
       | struct_type

primitive_type ::= (Type BuiltinType) stars
struct_type    ::= Struct Ident stars
sizeof_type    ::= type [Int]*

BuiltinType ::= Int
              | Double
              | Char
              | Void

stars ::= (Asterisk)*

formals ::= [ type Ident (Comma type Ident)* ]

stmt ::= expr Semi
       | block
       | While LParen expr RParen stmt
       | For LParen [expr] Semi [expr] Semi [expr] RParen stmt
       | If LParen expr RParen stmt
       | If LParen expr RParen stmt Else stmt
       | Return [expr] Semi
       | vardecl

block ::= LBrace [stmt]* RBrace

expr ::= LitInt
       | LitString
       | LitChar
       | LitDouble
       | LitNull
       | Ident
       | (expr)
       | expr op expr
       | Minus expr
       | Not expr
       | arrayaccess
       | fieldaccess
       | indirect
       | deref
       | addressof
       | funccall
       | assign
       | sizeof
       | typecast
       | expr Increment
       | expr Decrement

op ::= Less
     | Leq
     | Greater
     | Geq
     | Neq
     | Equal
     | Plus
     | Minus
     | Div
     | Asterisk
     | Mod
     | Or
     | And
     | Caret
     | Ampers
     | Bar

funccall    ::= Ident LParen actuals RParen
              | Ident LParen RParen
arrayaccess ::= expr LBrack expr RBrack
fieldaccess ::= expr Dot Ident
indirect    ::= expr Arrow Ident
deref       ::= Asterisk expr
addressof   ::= Ampers expr
sizeof      ::= Sizeof LParen type RParen
              | Sizeof LParen expr RParen
typecast    ::= LParen type RParen expr
assign      ::= expr Assign expr

actuals ::= expr
          | actuals ',' expr

About

mcc is a toy compiler compiling a (relatively) large subset of C language to LLVM IR.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages