Skip to content

STM8 eForth Compiler and Assembly

Richard edited this page Dec 21, 2023 · 26 revisions

STM8 eForth Compiler and Assembly Interface

STM8 eForth packs an optimized and efficient Forth into a small amount of Flash. Even a "Low density" device with 8K Flash ROM has sufficient space left for typical STM8 applications. However, there are times when a little bit of Assembly goes a long way. Some words can be coded in Assembler so efficiently that a much larger task can be squeezed into the available space. And there are times where you will want to emulate start-up delays like an AVR fuse setting does on Atmel MCU's. That requires some Assembler and patching of the reset vector. Forth is by nature close to the underlying machine code of the MCU. Understanding Assembler can allow you to achieve things in STM8 eForth that would otherwise be impossible.

There are other reasons to write some assembler code, e.g.:

  • optimizing inner loops of time critical routines, e.g. for "bit-banging"
  • relocatable code, e.g. Flash block operations in RAM
  • for avoiding to use Forth literals in interrupt service routine
  • operations on the bit or byte level or often closer to the STM8 µC instruction set than to Forth

Assembly code fragments can also be generated by Forth "compiler extensions" words. Examples are ]B! or ]B? for setting or testing bits.

Code Generated by the Compiler

STM8 eForth is a Subroutine Threaded (STC) Forth with 16bit word width. Unlike DTC or ITC code, STC doesn't require an "Inner Interpreter". The native CALL instruction replaces the interpreter, and compiled Forth is simply executable machine code with subroutine calls.

Compared to DTC or ITC, STC needs one byte more for a compiled wird. In order to reduce the overhead STM8 eForth compiler applies the following optimizations:

STM8 eForth (bytes) Standard STC (bytes) DTC (bytes) Comment
RET (1) CALL EXIT (3) EXIT (2)
JP <TARGET> (3) CALL BRANCH <TARGET> (5) BRANCH <target> (4) also see ALIAS feature
CALLR <rel> (2) CALL <target> (3) <target> (2) only calls "within reach"
TRAP <literal> (3) CALL DOLIT <literal> (5) DOLIT <literal> (4) TRAP serves as a pseudo-opcode

The optimizations in STM8 eForth improve code density - it's usually on par with DTC or ITC code.

Forth Words in STM8 Assembly

The STM8 eForth library folder contains a number of words that create machine code at compile time (more about that below). Coding some application words in assembly, e.g. writing words that write constants to memory, set or reset bits, or inject special opcodes like HALT, is often useful.

When writing Forth words in assembly, the following rules apply:

  • the STM8 X register serves as the Data Stack Pointer and (X) points to TOS. It can be used withing words, but it must be a valid again at the end of the word.
  • A, Y and YTEMP are working registers. When calling other Forth words the content of the working registers must be assumed lost (but A, Y, and the flags N, C and Z may correspond to TOS - see below).

The STM8 16 bit instruction set isn't fully symmetrical. For TOS manipulation, it's useful to make the X and Y registers perform a little "dance" around the Opcode EXGW. An example is SRL in the library:

\ SRL: TOS shift right logical (unsigned divide by 2)
: SRL ( n -- n )
  [
    $9093 ,  \ LDW Y,X
    $FE C,   \ LDW X,(X)
    $54 C,   \ SRLW X
    $51 C,   \ EXGW X,Y
    $FF C,   \ LDW (X),Y
  ] ;

The STM8 instruction set encoding of addressing modes that use the X register are more compact than those using the Y register. The helper word DOXCODE can be used for writing compact assembly code by providing an "execution capsule" for TOS in X. The "CRC-16-ANSI code" from the STM8 eForth library as a good example:

\ STM8: CRC-16-ANSI,  polynomial x16 + x15 + x2 + 1
#require DOXCODE

\ CRC-16-ANSI (seed with -1 for Modbus CRC)
: CRC16 ( n c -- n )
   XOR DOXCODE [
      $A608 ,  \         LD      A,#8
      $54 C,   \ 1$:     SRLW    X
      $2407 ,  \         JRNC    2$
      $01 C,   \         RRWA    X   ; XOR X,#0xA001
      $A801 ,  \         XOR     A,#0x01  
      $01 C,   \         RRWA    X
      $A8A0 ,  \         XOR     A,#0xA0
      $01 C,   \         RRWA    X
      $4A C,   \ 2$:     DEC     A
      $26F3 ,  \         JRNE    1$
   ] ;

Mixing Forth Code and Assembly

Compiled Forth code and assembly code can be mixed as long as X, the Data Stack Pointer, remains unchanged (except for stack pointer manipulation).

In compiled Forth code A, Y, and YTEMP are working registers that, after the execution of most core Forth words, may contain the TOS value. Assembly can be mixed with core Forth words as follows:

The STM8EF glossary docs/words.md (a list of Forth words in forth.asm) describes the effect on flags and scratchpad variables in an extended data-flow notation:

;       @       ( a -- n )      ( TOS STM8: -- Y,Z,N )
;       Push memory location to stack

For programming in assembly this stack comment can be read as: "After execution of @ the register Y contains the TOS (top of stack) value, and the flags Z (zero) and N (negative) correspond to n"). The Forth stack comment conventions as described here.

There are Forth words for passing data to or from the stack and A or Y, e.g. for using these registers as index to the memory, or for coding inner loops of low-level peripherals access or interrupt routines:

Word Description
>A pull data stack to A and Flags
A@ push contents of A:shortAddr to data stack
A> push A to data stack
>Y pull data stack to Y and Flags
Y@ push contents of Y:Addr to data stack
Y> push Y to data stack

These words are available as ALIAS definitions in a out/<board/target.

Extending the compiler to create assembly code

The Forth compiler is just a special mode of the Forth interpreter. Hence, it's easy to write Forth words that assemble machine code. For example the following code compiles either a BSET <addr>,#<bit> or a BRES <addr>,#<bit>:

\ STM8EF : ]B!                                                         MM-170927
  \ Enable the compile mode and compile the code to set|reset the bit at addr.
  : ]B! ( 1|0 addr bit -- )
     ROT 0= 1 AND SWAP 2* $10 + + $72 C, C, , ]
; IMMEDIATE

There are more words in the library that directly compile STM8 machine code:

Word Description
[ a ]@ push contents of word at literal a to TOS
[ f a #b ]B! set bit #b in byte at literal a to literal f
[ a #b ]B? push flag with value of bit #b in byte at literal a to TOS
[ a #b ]BC copy bit #b in byte at literal a to carry flag
[ a #b ]BCPL complement bit #b in byte at literal a
[ c a ]C! set byte at literal a to literal c
[ a #b ]CB copy carry flag to bit #b in byte at literal a
[ f a #b ]SPIN spin until bit #b at literal a is equal to f

Forth control structures and assembly code

Some use cases require STM8 machine code branch instructions instead of ?branch or branch that use absolute addressing (e.g. relocatable code in RAM, fast ISRs). Calculating the branch offset is better left to the compiler.

#require >REL

NVM
: test ( f -- )  IF ."  True" ELSE ."  False" THEN ;
RAM

IF compiles to >Y JREQ rel (CD 8A C 27 B) and ELSE to JRA rel (20 A):

' test 20 dump
955B  CD 8A  C 27  B CD 89 91  5 20 54 72 75 65 20  A  ___'_____ True _
956B  CD 89 91  6 20 46 61 6C 73 65 81  0  0  0  0  0  ____ False______ ok

This approach makes using Forth control structures in relocatable code possible. Since the range of branch targets is limited to +/- 127 it's best to load >REL to RAM and to WIPE it after compiling the code that needs it.

The most important use case is replacements for IF like ([ a #b ]B@IF) or [ c ]A<IF) which make good use of the Forth - Assembler interface to speed up code in device drivers or Interrupts in Forth.

Here is a list of ]..IF words in the library:

Word Description
[ a ]@IF test contents of word at literal a and perform IF w\ relative branch
[ ... c ]A<IF test if A is < literal c and perform IF w\ relative branch
[ a #b ]B@IF test bit #b at literal a and perform IF w\ relative branch
[ a ]C@IF test contents of byte at literal a and perform IF w\ relative branch
[ ... n ]Y<IF test if Y is < literal n and perform IF w\ relative branch

Writing new [ .. ]..IF words is easy. The following example from the IR Remote Control Demo implements a JRNC branch:

#require >REL

\ ( carry-flag ) IF with relative addressing
: ]CFIF ( -- ) $24 C, ] >REL ;  \ JRNC rel

Please refer to the IR RC Demo above or to the STM8 I2C driver for more usage details.

Forth Cross-Assembler in e4thcom

There is an experimental Forth cross-assembler in e4thcom. Some STM8 op-codes can not yet be rendered, and it's not fully tested. If you're interested in testing it, please write an issue.

Forth Boot

This section uses the MINDEV board as an example. Enough information is provided to allow you to apply these concepts to the other supported boards.

During the start-up process the STM8 chip waits until the power supply voltage rises to an acceptable voltage, sets default register values and then starts program execution at the reset vector address, $8000. Looking at the file MINDEV.ihx with the ST visual programmer one can see that the first four bytes of machine code at $8000 is $82 $00 $80 $6B. This is the Reset Vector address.

On reset, the program execution begins at $806B which hooks into COLD, which calls 'BOOT and then we jump to QUIT to start interpreting.

The mechanics of this are not vital. But look at COLD in the listing FORTH.RST and you will see at line 519 COLD initialises the UART. Which is fine until you discover your device hanging off the UART has yet to initialise and a latch-up of that attached device occurs.

The fix? With an AVR we would set a fuse to delay start up until more clock cycles have passed. With the STM8 we can do this from the terminal prompt:

NVM HERE \ pushes address we are about to start writing op-codes to onto the stack
      $90AE , 12500 ,  \     LDW   Y, #(t*8)
      $A604 ,          \ 1$: LD    A, #4
      $4A C,           \ 2$: DEC   A
      $26FD ,          \     JRNE  2$ 
      $905A ,          \     DECW  Y
      $26F7 ,          \     JRNE  1$
      $CC C, $8002 @ , \     JP    #(RESET)  remember, $8002 was were the reset vector was pointing to.
$8002 ! RAM            \ save the address we started writing the op-codes at into the reset vector.

Now, every time the device is turned on it calls this piece of code which does nothing more than insert a delay before COLD is executed. With a value of 125 the routine above will result in a 1ms delay (125 * 8µs) before COLD will be even called. A value of 12500 would delay just 100ms. Depending on the required delay some bytes can be saved by removing the inner loop.

If at a later date you wish to RESET then you must restore the reset vector first with:

NVM 
$8002 @ $E +  \ Get the address of the new reset vector, add $E bytes to where the old reset vector was saved, 
@ 8002 !      \ get the old reset vector address and save it back into the reset vector
RAM

'BOOT will, of course, work normally.

Software Reset

Some of the concepts above are demonstrated by this example. Once your project is running un-tethered the need to restart the micro can arise. Perhaps a serial link has stopped working. Writing code to continually test the link can be done, except the potential exists that the error lies within the micro somewhere. Expediency can justify simply restarting the micro. Doing this with software is straightforward with a small piece of assembly. Here is an example where the micro is put to sleep for 60 seconds and upon waking up is reset:

: RST60S ( -- )    \ halt for 60SECS AND RESET 
  [ 20 AWU_APR ]C! [ $F AWU_TBR ]C!   \ load the AWU registers
  [ 16 AWU_CSR1 ]C! \ enable AWU
  [ $8E C, ] \ HALT for AWU period
  [ $80  WWDG_CR ]C! \ on wake-up force a reset
 ;

]C! is your friend. Start using it wherever you imght have written code such as 20 AWU_APR C! and your code will use much less memory. There is no high level alternative to [ $8E C, ].

Low power consumption

Once your project is powered by batteries the ~5mA the STM8 draws is significant. There are two approaches to reducing power consumption: reduce the power used while running and reduce the power when asleep.

The approach while code is running is straight from the reference manuals:

While in Run mode, still keeping the CPU running and executing code, the application has several ways to reduce power consumption, such as:
• Slowing down the system clocks
• Gating the clocks to individual peripherals when they are unused
• Switching off any unused analog functions

However, when the CPU does not need to be kept running, three dedicated low power modes can be used:
• Wait
• Active-halt (configurable for slow or fast wakeup)
• Halt (configurable for slow or fast wakeup)

Here is an example where the UART is disabled and the flash and master voltage regulator set for lowest power consumption possible when device is halted:

: SET_LP \ setup for low power ( run fast sleep longer strategy used here ) 
   [ $00 UART_CR2 ]c! \ DISABLE UART, could be extended to other peripherals 
   [ 0 3 CLK_PCKENR1 ]B!  \ turn off clock to uart ( for STM8S003. Can be different for other STM8's )  
   [ 1 FLASH_CR1 2 ]B!  \ set flash to power down when in active halt state  
   [ 1 CLK_ICKR 5 ]B!   \ turn off master voltage regulator when in active halt  
;

No need to run the UART if your micro is not using a serial comms port, saves some 120uA on the STM8003. Another 100uA can be found by turning off the SPI and I2c peripherals if they are unused.

If your application requires the STM8 to do nothing for periods of time then a large reduction in power consumption is achieved by putting the micro to sleep. Just pausing can save around 4mA for the period concerned. To reduce that still further requires the flash and MVR be configured as per SET_LP. Here is an example where sleep of differing lengths is implemented:

: (PAUSE)           \ factored part of "pause" 
  [ 16 AWU_CSR1 ]C! \ enable AWU
  [ $8E C, ]        \ HALT for AWU period
  [ $0 AWU_TBR ]C!  \ to minimise power consumption once awake
  ;
 : PAUSE20 ( -- )                   \ halt for 20ms 
  [ 40 AWU_APR ]C! [ 7 AWU_TBR ]C!  \ load AWU registers
  (pause)                           \ now pause 
 ;
: PAUSE500 ( -- )                   \ halt for 500ms
  [ 62 AWU_APR ]C! [ $B AWU_TBR ]C! \ load AWU regs
  (pause)                           \ now pause 
 ;

While paused an indicated consumption of 0.01 mA was achieved compared with the maximum of 20uA expected. That included the quiescent current for the STM8S003, a voltage detector IC and low drop out regulator.

If your application does not go to sleep often then slowing down the CPU clock would prove beneficial.

Faster and Smaller

When you really need fast then smaller is the key. Using the "compiler extension" words is well worth the learning curve. I used TIMER4 to test code as follows:

VARIABLE 'TEST
10 CONSTANT TEST1
: TIMOH TIMON TIMOFF TIMER@ 'TEST ! ; \ overhead in timer use stored in 'TEST
TIMOH
: T@. TIMER@ 'TEST @ - . CR ;
: A@ [ 200 ]@ ;
: B 
   TIMSETUP
   TIMON DUP SWAP TIMOFF T@.   \ ToS is duplicated and later on brought back to top of stack  **50**
   TIMON DUP TIMOFF T@.        \ Same as above but no DUP                                     **23**
   TIMON 'TEST @ TIMOFF T@.    \ Don't dup and swap, simply fetch again                       **49**
   TIMON TEST1 TIMOFF T@.      \ Use a constant instead where possible                        **50** 
   TIMON [ 200 ]@ TIMOFF T@.  \ Use the Word ]@ instead of @ where your variable is in the 256 bytes shortmem **8**
   TIMON A@ TIMOFF T@.        \ note penalty for calling A@ over compiling inline as per line above **18**
   TIMOH TIMON $4000 @ TIMOFF T@. \ Eeprom is slower than Flash                               **71**
   TIMOH TIMON [ $4000 ]@ TIMOFF T@. ; \ But using ]@ reduces the penalty significantly       **20**
;

Executing B returns the typical values shown as comments above.

Using Timer4 I tested a few things and this is what I found:

  1. Fetching variables is faster than manipulating the stack with a DUP and later a SWAP saving 1 clock cycles. But it's far less error prone then stack manipulation.
  2. Where you consume the ToS i.e. no SWAP needed then DUP is faster than fetching the Variable again
  3. Using a constant is 1 clock cycles slower than fetching a variable.
  4. Get your variables into shortmem (see below) by defining them before anything else and ]@ is 6 times faster than @
  5. The overhead of calling a word which has compiled ]@ is significant, but it is still faster than the other alternatives tested. It might make your code easier to write and debug in the first instance.
  6. If you're storing variables in Eeprom so you can update them while developing it will be faster to use a shadow copy if you fetch them often. However, using ]@ makes fetching from EEPROM significantly quicker.

And if you are working with registers the assembly version is smaller and faster e.g.

\ Timsetup takes 48 bytes of storage, but TimSetupA below uses just 29 bytes.

: TimSetup ( -- ) \ setup timer 4 to count at clock Fmaster
   0 TIM4_CR1 C!
   0 TIM4_EGR C!
   0 TIM4_SR C!
   0 TIM4_PSCR C! \ prescaler value of 2^0 or 1
;
: TimSetupA ( -- ) \ setup timer 4 to count at clock Fmaster
   [ 0 TIM4_CR1 ]C!
   [ 0 TIM4_EGR ]C!
   [ 0 TIM4_SR ]C!
   [ 0 TIM4_PSCR ]C! \ prescaler value of 2^0 or 1
;

A note on memory addressing

Throughout the STM8 instruction set manual you will see references to shortmem and longmem. Opcodes using Shortmem are manipulating the first 256 bytes of RAM. The advantage of 'shortmem' is that the opcodes are two bytes shorter. There is no speed penalty for using longmem addresses in RAM. However, if you use shortmem opcodes then you need to check that the address you are trying to access is in fact in shortmem. Best to define such variables first before any other RAM is used by your application and watch for the 256 bytes boundary.

Clone this wiki locally