Skip to content

AlysonNumberFIVE/CompilerGenerator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CompilerGenerator

The purpose of this repo is to store all the code for my current Medium project.

This project, currently entitled Compiler Generator, is a series of project/articles covering my attempt at making a metacompiler from scratch in Golang, following along with the book Engineering a Compiler.

Table of Contents

  • Compiler Structure
  • Progress so Far
    • The Scanner
    • Spec File Format
    • IMPORTANT: Controlling spec syntax tokens clashing with target language
  • TODO
  • A short demo
  • How to run it
    • Changes to Execution on the way

Compiler Structure

This project is divided into three parts;

-The Frontend - The part of the compiler responsible for injesting a raw source code and extracting valid tokens and verifying accurate semantics of the language.
-The middleware - Intermediate code generation that converts source code into an intermediate pseudocode-like form. Optimization happens here.
-The Backend - The part of the compiler responsible for converting the intermediate representation into the target system's assmebly architecture.

The aim is to create a metacompiler that, once a language's specification is ingested alongside valid source file(s) of the language to be compiled, the language is successfully compiled into the target architecture of a specified system.

Progress so Far

The Scanner

Currently, an automated scanner generator is implemented. Similar to the lex/flex utilities, this one runs on a spec file mainly based on the usage of regex pattern matching.

A C language spec file (see the full file in specfiles/c.spec):


# Classifier

alphabet    [_a-zA-Z]
digit       [0-9]
number      {digit}+
newline     %NEWLINE
word        {alphabet}({alphabet}|{digit})*
symbols     [-+/\*&!\|\{\}=><:^;,]
equ         ([+-/=*!&\|]|((>)?>)|((<)?<))?=
left        (<)?<
right       (>)?>
brackets    [\[\]\(\)]
comment     //.*{newline}
mcomment    /\*.*\*/
float       [0-9]+((\.[0-9]*)|e((\+|-)?[0-9]+))
hex         0[xX][a-fA-Z0-9]+
string      ".*"
char        '[(\')(\t)(\n)]|(.*)'
arrow			  ->

%%

# Delim
'     {char}
"     {string}
//    {comment}
/\*   {mcomment}

%%

# TokenType

{string}    STRING
{number}    INTEGER
{word}      ID
{char}      CH
char        CHAR
int         INT
long        LONG
void        VOID
unsigned    UNSIGNED
*           STAR
...

Spec file format

The spec file is akin to modified version of the .l format used to compile lex parsers. Unlike lex which generates .c source files that compile into the parser themselves, spec is the first part of the metacompiler and directly outputs a Go list of structs that contain each token. With a spec file set, it reads input files and tokenizes the input file going ooff the rules in the spec file.

Structure

Similar to the lex file format, the spec format has 3 sections, each divided by a pair of %%. The spec design is based on table driven scanner design.

  • Note: a table driven scanner makes use of 3 tables, one with a collection of acceptable DFAs used by the language, a second for classifying input types and a third that holds the valid tokens generated by accepting DFA states. The spec file format is designed to mimick this behaviour with a few modifications to simplify the programming process.

Each section of the spec file's data is laid out as follows:

  • Classifier list - This has a list of all the language's regexes. These, alongside helper functions inside the scanner, simulate the concept of DFA state traversal, the saving of accepting states and rollbacks in case of failed states.
  • Delim list - This table is for all language constructs that rely on delimiting tokens to end their parsing. Tokens that go here are those for strings and comments.
  • TokenType list - These are all the valid tokens that are accepted by the specified language.

A slight drawback with this format's design is the heavy reliance on regex knowledge to simulate the DFAs needed for the target language. An optional add-on will be an init file that auto-generates common regexes for common language constructs to speed up spec file development.

IMPORTANT: Controlling spec syntax tokens clashing with target language

A glaring issue I ran into was the differentiaion of tokens like { and } between the ones to be parsed and the ones that are part of the spec file's syntax. This is handled by variables written in all-caps and prefixed with % tokens. Ones that are currently set by default are:

%NEWLINE in place of \n
%TAB in place of \t
%PC in place of %
%LBC in place of {
%RBC in place of }
%HSH in plaec of #

Should you need to define \n or \t as part of your language's spec, use these variables in place of their literals to avoid your tokens being treated as spec syntax.

TODO

  • Add functionality for error logging and reporting. This will be quite monolithic as this must span entire commpiler.
  • Complete unittesting
  • More reading for the Automated Parsing phase
  • Resolve clashing tokens -- IN TESTING

A short demo

Note: this demo's slow as I didn't realize until I'd done the work of extracting it and saving it that my mouse scrolling wasn't tracket (I forgot session recording works :'| ) so all the still/empty space is me scrolling... I'd recommend you give it a go if it looks interesting but this kinda gives you an idea of how it works anyway :).

How to run it

Start by compiling the project

go build .

Then, set your spec file in the config file. Current spec (for demo purposes) can be found in ./specfiles.

Inside config

config:[specfile path]

And then run the scanner against a target source file

./compiler [source file]

Changes to Execution on the way

  1. The option of passing in a config spec file will be added as a command line parameter
    Example:
./compiler --config [spec file]
  1. The option to output a token file with all your tokenized values with a commmand line parameter
    Example:
./compiler --outfile [filename] 
  1. A default init option for spec file creation that will initialize an empty spec file with all common regex patterns already set.
    Example:
$>./compiler --init-spec
# Classifier 

alphabet                [_a-zA-Z]
digit                   [0-9]
number                  {digit}+
newline                 %NEWLINE
word                    {alphabet}({alphabet}|{digit})*
symbols                 [-+/\*&!\|=><:^;,]
lbrace                  %LBRC
rbrace                  %RBRC
equ                     ([+-/=*!&\|]|((>)?>)|((<)?<))?=
left                    (<)?<
right                   (>)?>
brackets                [\[\]\(\)]
float                   [0-9]+((\.[0-9]*)|e((\+|-)?[0-9]+))
hex                     0[xX][a-fA-Z0-9]+
string                  ".*"
char                    '[(\')(\t)(\n)]|(.*)'

%%

# Delims
'       {char}
"       {string}
0[xX]   {hex}

%%
# TokenType

# tokens go here
$>

Coded extensively by AlysonBee (Alyson Ngonyama)

Releases

No releases published

Packages

No packages published