CompilerGenerator

The purpose of this repo is to store all the code for my current Medium project.

This project, currently entitled Compiler Generator, is a series of project/articles covering my attempt at making a metacompiler from scratch in Golang, following along with the book Engineering a Compiler.

Compiler Structure

This project is divided into three parts;

-The Frontend - The part of the compiler responsible for injesting a raw source code and extracting valid tokens and verifying accurate semantics of the language.
-The middleware - Intermediate code generation that converts source code into an intermediate pseudocode-like form. Optimization happens here.
-The Backend - The part of the compiler responsible for converting the intermediate representation into the target system's assmebly architecture.

The aim is to create a metacompiler that, once a language's specification is ingested alongside valid source file(s) of the language to be compiled, the language is successfully compiled into the target architecture of a specified system.

Progress so Far

The Scanner

Currently, an automated scanner generator is implemented. Similar to the lex/flex utilities, this one runs on a spec file mainly based on the usage of regex pattern matching.

A C language spec file (see the full file in specfiles/c.spec):


# Classifier

alphabet    [_a-zA-Z]
digit       [0-9]
number      {digit}+
newline     %NEWLINE
word        {alphabet}({alphabet}|{digit})*
symbols     [-+/\*&!\|\{\}=><:^;,]
equ         ([+-/=*!&\|]|((>)?>)|((<)?<))?=
left        (<)?<
right       (>)?>
brackets    [\[\]\(\)]
comment     //.*{newline}
mcomment    /\*.*\*/
float       [0-9]+((\.[0-9]*)|e((\+|-)?[0-9]+))
hex         0[xX][a-fA-Z0-9]+
string      ".*"
char        '[(\')(\t)(\n)]|(.*)'
arrow			  ->

%%

# Delim
'     {char}
"     {string}
//    {comment}
/\*   {mcomment}

%%

# TokenType

{string}    STRING
{number}    INTEGER
{word}      ID
{char}      CH
char        CHAR
int         INT
long        LONG
void        VOID
unsigned    UNSIGNED
*           STAR
...

Spec file format

The spec file is akin to modified version of the .l format used to compile lex parsers. Unlike lex which generates .c source files that compile into the parser themselves, spec is the first part of the metacompiler and directly outputs a Go list of structs that contain each token. With a spec file set, it reads input files and tokenizes the input file going ooff the rules in the spec file.

Structure

Similar to the lex file format, the spec format has 3 sections, each divided by a pair of %%. The spec design is based on table driven scanner design.

Note: a table driven scanner makes use of 3 tables, one with a collection of acceptable DFAs used by the language, a second for classifying input types and a third that holds the valid tokens generated by accepting DFA states. The spec file format is designed to mimick this behaviour with a few modifications to simplify the programming process.

Each section of the spec file's data is laid out as follows:

Classifier list - This has a list of all the language's regexes. These, alongside helper functions inside the scanner, simulate the concept of DFA state traversal, the saving of accepting states and rollbacks in case of failed states.
Delim list - This table is for all language constructs that rely on delimiting tokens to end their parsing. Tokens that go here are those for strings and comments.
TokenType list - These are all the valid tokens that are accepted by the specified language.

A slight drawback with this format's design is the heavy reliance on regex knowledge to simulate the DFAs needed for the target language. An optional add-on will be an init file that auto-generates common regexes for common language constructs to speed up spec file development.

IMPORTANT: Controlling spec syntax tokens clashing with target language

A glaring issue I ran into was the differentiaion of tokens like { and } between the ones to be parsed and the ones that are part of the spec file's syntax. This is handled by variables written in all-caps and prefixed with % tokens. Ones that are currently set by default are:

%NEWLINE in place of \n
%TAB in place of \t
%PC in place of %
%LBC in place of {
%RBC in place of }
%HSH in plaec of #

Should you need to define \n or \t as part of your language's spec, use these variables in place of their literals to avoid your tokens being treated as spec syntax.

TODO

Add functionality for error logging and reporting. This will be quite monolithic as this must span entire commpiler.
Complete unittesting
More reading for the Automated Parsing phase
Resolve clashing tokens -- IN TESTING

A short demo

Note: this demo's slow as I didn't realize until I'd done the work of extracting it and saving it that my mouse scrolling wasn't tracket (I forgot session recording works :'| ) so all the still/empty space is me scrolling... I'd recommend you give it a go if it looks interesting but this kinda gives you an idea of how it works anyway :).

How to run it

Start by compiling the project

go build .

Then, set your spec file in the config file. Current spec (for demo purposes) can be found in ./specfiles.

Inside config

config:[specfile path]

And then run the scanner against a target source file

./compiler [source file]

Changes to Execution on the way

The option of passing in a config spec file will be added as a command line parameter
Example:

./compiler --config [spec file]

The option to output a token file with all your tokenized values with a commmand line parameter
Example:

./compiler --outfile [filename]

A default init option for spec file creation that will initialize an empty spec file with all common regex patterns already set.
Example:

$>./compiler --init-spec
# Classifier 

alphabet                [_a-zA-Z]
digit                   [0-9]
number                  {digit}+
newline                 %NEWLINE
word                    {alphabet}({alphabet}|{digit})*
symbols                 [-+/\*&!\|=><:^;,]
lbrace                  %LBRC
rbrace                  %RBRC
equ                     ([+-/=*!&\|]|((>)?>)|((<)?<))?=
left                    (<)?<
right                   (>)?>
brackets                [\[\]\(\)]
float                   [0-9]+((\.[0-9]*)|e((\+|-)?[0-9]+))
hex                     0[xX][a-fA-Z0-9]+
string                  ".*"
char                    '[(\')(\t)(\n)]|(.*)'

%%

# Delims
'       {char}
"       {string}
0[xX]   {hex}

%%
# TokenType

# tokens go here
$>

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
assets		assets
parser		parser
scanner		scanner
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

parser

parser

scanner

scanner

README.md

README.md

Repository files navigation

CompilerGenerator

Table of Contents

Compiler Structure

Progress so Far

The Scanner

Spec file format

Structure

IMPORTANT: Controlling spec syntax tokens clashing with target language

TODO

A short demo

How to run it

Changes to Execution on the way

Coded extensively by AlysonBee (Alyson Ngonyama)

About

Releases

Packages

Languages

AlysonNumberFIVE/CompilerGenerator

Folders and files

Latest commit

History

Repository files navigation

CompilerGenerator

Table of Contents

Compiler Structure

Progress so Far

The Scanner

Spec file format

Structure

IMPORTANT: Controlling spec syntax tokens clashing with target language

TODO

A short demo

How to run it

Changes to Execution on the way

Coded extensively by AlysonBee (Alyson Ngonyama)

About

Topics

Resources

Stars

Watchers

Forks

Languages