Skip to content

Howeng98/CompilerHW1

Repository files navigation

Lexical Analyzer (Scanner)

This assignment is to write a scanner for the μC language (NOT C language) with lex. This document gives the lexical definition of the language, while the syntactic definition and code generation will follow in subsequent assignments.

1. Lexical Definitions

Tokens are divided into two classes:

  • tokens that will be passed to the parser, and
  • tokens that will discarded by the scanner (e.g., recognized but not pased to the parser).

1.1 Tokens that will be passed to the parser

The following tokens will be recognized by the scanner and will be eventually passed to the parser.

1.1.1 Delimiters

1.1.2 Arithmetic, Relational, and Logical Operators

1.1.3 Keywords

1.1.4 Identifiers

An identifier is a string of letters (a ~ z , A ~ Z , _ ) and digits ( 0 ~ 9 ) and it begins with a letter or underscore. Identifiers are case-sensitive; for example, ident , Ident , and IDENT are not the same identifier. Note that keywords are not identifiers.

1.1.5 Integer Literals and Floating-Point Literals

Integer literals: a sequence of one or more digits, such as 1 , 23 , and 666 . Floating-point literals: numbers that contain floating decimal points, such as 0.2 and 3.141 .

1.1.6 String Literals

A string literal is a sequence of zero or more ASCII characters appearing between double-quote ( " ) delimiters. A double-quote appearing with a string must be written after a " , e.g., "abc" , "Hello world" , and "She is a \"girl\"" .

1.2 Tokens that will be discarded

The following tokens will be recognized by the scanner, but should be discarded, rather than returning to the parser.

1.2.1 Whitespace

A sequence of blanks(spaces), tabs, and newlines.

1.2.2 Comments

Comments can be added in several ways:

  • C-style is texts surrounded by /* and */ delimiters, which may span more than one line
  • C++ style comments are a text following a // delimiter running up to the end of the line.

Whichever comment style is encountered first remains in effect until the appropriate comment close is encountered. For example,

// this is a comment // line */ /* with /* delimiters */ before the end and /* this is a comment // line with some and C delimiters */ are both valid comments.

1.2.3 Other characters

The undefined characters or strings should be discarded by your scanner during parsing.

2. What should we do in this Scanner

2.1 Assignment Requirements

Here we have prepared 11 μC programs, which are used to test the functionalities of your scanner.

python3 judge/judge.py input/in01_arithmetic.c output/in01.out

2.2 Output

2.3 Debug

Make Makefile

$ make clean && make

Execute

$ ./myscanner < input/in01_arithmetic.c > output/in01.out

Check diff

$ diff -y tmp.out answer/in01_arithmetic.out
$ od -c answer/in05_comment.out

3. Environmental Setup

  • Ubuntu 20.04 LTS
  • Install dependencies: $ sudo apt install gcc flex bison python3 git

4. References

About

Implementation of Lexical Analyzer ( Scanner ) part of compiler

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published