Skip to content
Stefano edited this page Oct 14, 2019 · 6 revisions

PI.C

{{libnew:examples:pi.png?318|Pi to 800 decimal places}}

Here is a fun program that computes pi to 800 decimal places.

The purpose of this example is to show how a program's performance can be improved through simple source code modifications and appropriate library configuration.

Original Source Code

The original source code was written for a 32-bit machine. On 32-bit machines, ints are 32-bit so a straightforward adaptation to z88dk simply replaces ints with longs.

:::c
#include `<stdio.h>`

#pragma output CRT_ORG_CODE = 0
#pragma output REGISTER_SP  = 0

int main()
{
   long r[2800 + 1];
   long i, k;
   long b, d;
   long c;

#asm
di
ticks_start:
#endasm

   c = 0;
   
   for (i = 0; i < 2800; i++)
      r[i] = 2000;

   for (k = 2800; k > 0; k -= 14)
   {
      d = 0;
      i = k;
        
      while (1) 
      {
         d += r[i] * 10000;
         b = i * 2 - 1;

         r[i] = d % b;
         d /= b;
         
         i--;
         if (i == 0) break;
         
         d *= i;
      }
      
      printf("%.4d", (int)(c + d / 10000));

      c = d % 10000;
   }

#asm
ticks_end:
#endasm

   return 0;
}

We will measure the performance of the program in terms of program size and speed.

For program size we will look at the size of the code generated by the compiler. To find this out we will use the linker's "--split-bin" option to generate separate binaries for each declared section in the program. Among those sections will be "code_compiler" which will hold the compiler's assembler output.

For speed measurement we will use ticks which is a command-line z80 emulator that measures execution time in z80 cycles. Two assembler labels have been inserted into the C source so that we can grab the address range of instructions to be timed. Interrupts have been disabled so that they are not included in the result. Pragmas have been added to set the org to 0 and to move the stack to the top of memory. ticks takes as input a memory image that starts at address 0. We need to confirm the binary size does not exceed 16384 bytes as the zx target has its display file there and printf will be writing to that address.

For this first performance measurement the steps taken will be listed in detail.

1. Compile using sccz80 for the zx target. Any target that supports printf will do.

zcc +zx -vn -clib=new -startup=4 pi.c -o pi -m

The output will include the binary "pi_CODE.bin" and the map file "pi.map".

2. Measure execution time with ticks. Find the hex addresses of "ticks_start" and "ticks_end" from the map file created in the last step. A text search for "ticks" will reveal that ticks_start=17f7 and ticks_end=19c4. The counter value is the maximum count before ticks gives up; for this program it had to be raised to allow the program time to complete.

ticks pi_CODE.bin -start 17f7 -end 19c4 -counter 9999999999

The result is execution time in z80 cycles.

3. Determine size of the compiler generated code. We compile again but with the linker's --split-bin option enabled.

zcc +zx -vn -clib=new -startup=4 pi.c -o pi -Cl--split-bin

One binary file per section will be output. We are interested in the code section generated by the compiler.

dir *compiler.bin
478 pi_code_compiler.bin

Total compiler generated code is 478 bytes.

For sdcc, the compile lines used are:

zcc +zx -vn -clib=sdcc_ix -SO3 -startup=4 --max-allocs-per-node200000 --reserve-regs-iy pi.c -o pi -m
zcc +zx -vn -clib=sdcc_iy -SO3 -startup=4 --max-allocs-per-node200000 pi.c -o pi -m

Warnings will be issued that z80instructionSize() failed. This is because the peephole optimizer does not understand "#asm" and "#endasm" and these can be ignored.

Performance

BINARY SIZE *3 C CODE SIZE TIME
sccz80 7976 (19200) 478
sdcc_ix *1 8102 (19326) 682
sdcc_iy *2 8094 (19318) 652

*1 Space allocated for local variables exceeds 128 bytes which causes sdcc's compile to generate code using iy despite the --reserve-regs-iy flag. This is a known issue and we ignore it here as the next versions reduce the size of the stack frame.

*2 The --reserve-regs-iy issue means the sdcc_iy compile is bugged.

*3 The total program size is going to include space for the terminal driver which includes font, line editing and terminal emulation code. It's not indicative of the C code itself but it is included so that the impact of some library options can be seen. The number in brackets includes the space allocated on the stack for the local variables.

Source Code Optimization

1. Choose the smallest appropriate types for variables.

The z80 has 8- and 16-bit characteristics but not 32-bit. To get best performance we want to get rid of as many longs as possible.

  • k. The maximum value of k is 2800. It can be an int.

  • i. The maximum value of i is 2800. It can be an int.

  • b. The maximum value of b is 5599. It can be an int.

  • r. The maximum value of r[i] is 5598. Its elements can be int.

  • c. The maximum value of c is 9999. It can be an int.

"d" is the only variable that needs to be 32-bit.

2. Choose unsigned types whenever possible.

Going unsigned prevents sign extension code from being inadvertently inserted by the compiler and many z80 operations are faster for unsigned types.

3. Prefer static to local.

Static variables are assigned to a fixed memory address. Local variables are created temporarily on the stack.

The z80 is not very good at accessing data relative to the stack pointer. By making local variables static, the program will be both faster and smaller. Doing this does mean the affected code is no longer reentrant but in this case that doesn't make a difference.

4. Use pre-decrement rather than post-decrement where possible.

If you are familiar with C++ this is probably already burned into your brain. It applies to C as well and for the same reason. A post-increment operation like "i++" means "increment the value of i by one but return the old value as result." Returning the old value means the compiler has to first save the old value, then increment i, then restore the old value for the result. On the other hand, a pre-increment operation as in "++i" means "increment the value of i and return the new value as result." The compiler no longer has to remember what the old value was.

The impact is small but it is there and it is a good habit to acquire.

Following the above points, the new source code appears as below. The types from stdint.h are being used as they indicate bit widths in their names. Also note the care taken to ensure type conversions are explicit in the source. A known issue with sccz80 is that it can sometimes demote longs in mixed integer expressions unless explicit typing is present.

:::c
#include `<stdio.h>`
#include `<stdint.h>`

#pragma output CRT_ORG_CODE = 0
#pragma output REGISTER_SP  = 0

int main()
{
   static uint16_t r[2800 + 1];
   static uint16_t i, k;
   static uint16_t b;
   static uint32_t d;
   static uint16_t c;

#asm
di
ticks_start:
#endasm

   c = 0;
   
   for (i = 0; i < 2800; ++i)
      r[i] = 2000;

   for (k = 2800; k > 0; k -= 14)
   {
      d = 0;
      i = k;
        
      while (1) 
      {
         d += (uint32_t)(r[i]) * 10000UL;
         b = i * 2 - 1;

         r[i] = (uint16_t)(d % (uint32_t)(b));
         d /= (uint32_t)(b);
         
         if (--i == 0) break;
         
         d *= (uint32_t)(i);
      }
      
      printf("%.4d", c + (uint16_t)(d / 10000UL));

      c = (uint16_t)(d % 10000UL);
   }

#asm
ticks_end:
#endasm

   return 0;
}

Something we didn't do is further prune the crt by introducing pragmas to eliminate features, such as the heap, that are not being used. Doing this would likely shave another 2k or so from the binary but since that does not affect speed it's not done here.

Performance

BINARY SIZE *1 C CODE SIZE TIME
sccz80 13310 316
sdcc_ix 13315 339
sdcc_iy 13310 339

*1 The sudden increase in binary size may be surprising at first but the explanation is simple. The array r[] occupies 5602 bytes. In the previous version of this program, r[] was allocated on the stack at runtime and was not part of the binary. In this version, it is allocated permanent memory in the bss section and this is included in the binary. Note that there is no free ride here -- whether r[] exists on the stack or in the bss section, it occupies space. Local variables have the additional advantage that they are deleted when they leave scope.

Library Configuration

Library configuration requires editing the target's "clib_cfg.asm" file and then re-building the library.

1. Speed up integer math. defc CLIB_OPT_IMATH = 75

A value greater than 50 selects the fast integer math library. The default options for the fast math library (CLIB_OPT_IMATH_FAST) enable leading zero-bit elimination in multiplies and divides. Other options do loop unrolling and LIA-1 overflow behaviour. The latter is not yet compatible with the compilers and the former comes at large memory cost so in this test we keep the default.

2. Reduce printf footprint. defc CLIB_OPT_PRINTF = $01

The only converter used is "%d" so the rest are removed.

The library is rebuilt by running "Winmake zx" from {z88dk}/libsrc/_DEVELOPMENT and the program is recompiled to get new results.

Performance

BINARY SIZE *1 C CODE SIZE TIME
sccz80 13585 316
sdcc_ix 13614 347
sdcc_iy 13584 347
*1 Perhaps hidden in these numbers is that the integer math library has expanded by about 900 bytes but printf has shrunk by about 700.

Source Code Transformation

This last step involves improving the source code. We won't do too much other than to note that two 32-bit divisions are in fact done four times. Have a look at these two excerpts:

:::c
r[i] = (uint16_t)(d % (uint32_t)(b));
d /= (uint32_t)(b);

and

:::c
printf("%.4d", c + (uint16_t)(d / 10000UL));
c = (uint16_t)(d % 10000UL);

A division and a mod are being performed using the same arguments. In effect, the same division is being done twice. On a 32-bit machine with hardware division available this might not matter too much but on a small micro, it contributes greatly to execution time.

The C library supplies special div() functions to get the quotient and remainder from a single division. In the standard, div() and family return a struct containing the quotient and remainder as result. Neither sdcc nor sccz80 supports passing structs as parameter or return value so the new clib modifies these functions to accept a pointer to a struct where the result will be written. Because they are different, the names of the functions have been changed to include surrounding underscores.

:::c
#include `<stdio.h>`
#include `<stdint.h>`
#include `<stdlib.h>`

#pragma output CRT_ORG_CODE = 0
#pragma output REGISTER_SP  = 0

int main()
{
   static uint16_t r[2800 + 1];
   static uint16_t i, k;
   static uint16_t b;
   static uint32_t d;
   static uint16_t c;
   static ldivu_t res;

#asm
di
ticks_start:
#endasm

   c = 0;
   
   for (i = 0; i < 2800; ++i)
      r[i] = 2000;

   for (k = 2800; k > 0; k -= 14)
   {
      d = 0;
      i = k;
        
      while (1) 
      {
         d += (uint32_t)(r[i]) * 10000UL;
         b = i * 2 - 1;

         _ldivu_(&res, d, (uint32_t)(b));
         
         r[i] = res.rem;
         d = res.quot;
         
         if (--i == 0) break;
         
         d *= (uint32_t)(i);
      }
      
      _ldivu_(&res, d, 10000UL);
      
      printf("%.4d", c + (uint16_t)(res.quot));

      c = res.rem;
   }

#asm
ticks_end:
#endasm

   return 0;
}

Performance

BINARY SIZE C CODE SIZE TIME
sccz80 13625 308
sdcc_ix 13620 325
sdcc_iy 13602 325

Summary

Being conscious of the differences between a 32-bit machine and a small micro and taking advantage of library configuration has led to about a 4x improvement in speed and a third reduction in binary size.

- ORIGINAL - IMPROVED -
Binary Size Time Binary Size Time
sccz80 19200 6,671,539,001 13625 1,458,354,526
sdcc_ix 19326 5,683,348,313 13620 1,513,023,612
sdcc_iy 19318 - 13602 1,474,179,212
Clone this wiki locally