Porting WLA DX to targer a new architecture

This is a work in progress! Right now this only duplicates information found in the backend/superfx branch! These are my notes on WLA internals as I examine the source code... eventually, it might become a technical reference manual of sorts for porting new architectures.- cr1901

#Introduction WLA:

Suite of assemblers.
Retargetable.
Feature-Rich.
Mature.
Written in (more-or-less) pure ANSI C, will compile with few complaints with GCC's -pedantic option.
Attempts to follow language specifications for each backend. That is, the syntax of the languages themselves should be source-compatible with other assemblers. This includes, opcodes, operators, and labels.
Directives are NOT source compatible with other assemblers for a given language. Though this perhaps could be mitigated with proper division of source files into directives and actual code.

Divided into a number of core source files:

main.c
pass[1-4].c
include_file.c
parse.c
stack.c

For those who wish to add a new architecture, users must provide their own source files for opcode tables, opcode structs, and opcode decoders. Additionally, the parser and include file facilities may need to be updated depending on the features of your target architecture.

WLA has provisions for helping a user generate their own opcodes in the following directories, relative to root of source tree:

./txt/opcodes
./opcode_table_generator
./print_opcodes

#How to Port What follows is my journey through analyzing the source code and adding my own backend (all directories relative to source tree root).

Zeroth, read the assembler manual and have some familiarity with the features of WLA before attempting a port (unsaid rule I suppose)!
First, make a git branch for your backend. I recommend the name of the type: backend/${TARGET}. i.e. "git checkout -b backend/superfx"
Next, create a Makefile for your target (see ./makefiles/makefile.${HOST_OS}.${TARGET} for examples). Choose a binary name, and a target-specific define, and other relevant compiler switches that are required for the compiler/linker so that it doesn't bitc- err, complain :P. Copy it to the top level directory. No need to create Makefiles for all host OSes now- just the one(s) you're using for adding your backend.

Future consideration: This BEGS for an m4 script, at the cost of some host-platform portability. Perhaps not even at that cost either, as long as they are generated ahead of time. As of 71d39ac4, m4 scripts exist to auto-generate Makefiles, but it currently supports only Unix or MSYS bash. Windows users can get m4 at the gnuwin32 project- it should work just fine without Cygwin, MSYS, or MinGW. Although I have the latter two installed :P.

Start with ./main.c:

Need to create a version string and a title for your assembler. Use the target-specific (called ${TARGET} from here on) define you chose in your Makefile to make sure the code conditionally compiles. Otherwise, main.c mainly does the housekeeping, initialization and calls the pass[1-4] functions.

./pass_1.c:

Your opcode tables/opcode structs are used here, as well as the opcode decoder. The opcode struct (called 'optcode') tends to follow a specific format, but it IS backend-defined. The opcode struct is defined in ./defines.h. Use the ${TARGET} define to create an opcode struct in the relevant section of defines.h.

Optcode struct

An example section follows for ${TARGET} => W65816

#ifdef W65816

/* opcode types */

/* 0 - plain text  8b */
/* 1 - x              */
/* 2 - ?              */
/* 3 - &              */
/* 4 - x/? (mem/acc)  */
/* 5 - x x            */
/* 6 - REP/SEP        */
/* 7 - x/? (index)    */
/* 8 - plain text 16b */
/* 9 - relative ?     */
/* a - x (absolute)   */

#define OP_SIZE_MAX 16

#ifdef AMIGA
struct optcode {
  char *op;
  int  hex;
  short int type;
  short int skip_xbit;
};
#else
struct optcode {
  char *op;
  int  hex;
  int  type;
  int  skip_xbit;
};
#endif

In the above example, skip_xbit is a target-specific field for the opcode struct. All processors in WLA, except SPC700 have at least one target-specific field to aid with opcode/mnemonic decoding. Thus, the base opcode struct consists of 3 fields:

struct optcode {
  char *op;
  int  hex;
  int  type;
};

In WLA, a type of 0 is hardcoded to represent immediate data types, while a type of -1 indicates "end of instruction array".

After you decide on an optcode struct, you must define an array of struct optcode called opt_table[] in the root directory, preferably with a name matching previously-existing files (opcode_${choose appropriate name here}). This file must be included in all files that handle instruction opcodes (list to be added). The opcodes MUST be in alphabetical order.

Future consideration: ANSI C permits typecasting an initial struct to another struct whose fields are a subset (and in the same order) of the initial struct.

Future consideration: Is the short qualifier actually needed for AMIGA, since int >= short?

opcode_p/opcode_n lookup tables

The assembler uses two lookup tables to more quickly find the proper opcode (machine code hex representation) for a given instruction mnemonic. These tables are generated using scripts in the ./opcode_table_generator subdirectory relative to the source root. opcode_p and opcode_n are arrays of type int, each with 256 elements. Each ASCII character (and at least UTF-8 characters with values under 128) is meant to index into one element in each table. Each value in opcode_p (position), when appropriate, will return an index into the opt_table[] defined previously where mnemonics/operands beginning with the current letter start. Each value in opcode_n (number), when appropriate, will return the number of opcodes in opt_table beginning with the current letter. Both arrays will return 0 if no mnemonics beginning with the current ASCII letter exist. By using the first letter of the given mnemonic, searching for the correct instruction and operands to assemble into its machine code equivalent can be sped up by reducing the number of string comparisons which must be done. I'm not sure if anything resembling an associative array search is ever done in WLA.

Future consideration: bsearch in stdlib.h may be able to speed this up more?

Temporary files

WLA will generate temporary file(s) before constructing the final file. Depending on the host OS, this file will take on various names, as demonstrated by the following conditional code:

#ifdef UNIX
  sprintf(gba_tmp_name, ".wla%da", (int)getpid());
  sprintf(gba_unfolded_name, ".wla%db", (int)getpid());
#endif

#ifdef AMIGA
  sprintf(gba_tmp_name, "wla_a.tmp");
  sprintf(gba_unfolded_name, "wla_b.tmp");
#endif

#ifdef MSDOS
#if 1 /*ndef WIN32*/
  sprintf(gba_tmp_name, "wla_a.tmp");
  sprintf(gba_unfolded_name, "wla_b.tmp");
#else
  sprintf(gba_tmp_name, ".wla%lda", GetCurrentProcessId());
  sprintf(gba_unfolded_name, ".wla%ldb", GetCurrentProcessId());
#endif  
#endif

These tempfiles consist of space-delimited text file (ASCII) representations of WLA's internal datatypes and values found as a source file is traversed, in the form of "${D}${V} ", where ${D} represents the current internal datatype, and ${V} is the corresponding value. A list of WLA's supported internal datatypes can be found in the corresponding section of ./defines.h.

Pass 1 overview

Opcodes are first parsed during pass1.c in the evaluate_token() function. By this point, the "lexer" in get_next_token() in parse.c has retrieved a string of characters between whitespace, called a token, into a buffer called tmp[]. Besides retrieving the string of characters, the lexer in get_next_token() has also determined that the string of characters does not begin and end with '"'- i.e. is NOT a C string (null-terminator added to tmp by the lexer if it is, and other processing takes place). Therefore, beyond knowing the token is NOT a C string, the function evaluate_token() has to determine what type of token the lexer retrieved- i.e. "what is the significance of these retrieved characters in creating the binary?".

The evaluate_token() function first attempts to check if the retrieved token is a label, or a directive, before attempting to match the token to an mnemonic.

#Additional Information ##Frequently used variables Well, variables that may need additional explanation :P.

extern i=> Current position in a buffer housing the open file.
include_file.c => include_file(char *) => id... include file In Directory flag (NOT identification)

##Guidelines for those working on WLA No compiler-specific extensions!

Submitting a bug report

At least use the "Issues" system here on GitHub, but if you have more time, please create a small project under the "bug_exhibition" source code directory and use it to display the bug. That will speed up the fixing process by a lot! :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly