Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get different optimisation levels to work in c2m #339

Open
DangerMouseB opened this issue May 31, 2023 · 5 comments
Open

Can't get different optimisation levels to work in c2m #339

DangerMouseB opened this issue May 31, 2023 · 5 comments

Comments

@DangerMouseB
Copy link

Hi,

This may well be a really daft question but I thought I'd ask. If I do:

c2m knight_mir.c -c -O0 -o knight_mir.bmir0
c2m knight_mir.c -c -O1 -o knight_mir.bmir1
c2m knight_mir.c -c -O2 -o knight_mir.bmir2
c2m knight_mir.c -c -O3 -o knight_mir.bmir3

or similarly with the -S option, the output files are identical. And if I run the program:

c2m knight_mir.c -O0 -eg
c2m knight_mir.c -O3 -eg

It runs at the same speed. Is it my program (below) or am I doing something wrong in using c2m etc?

Many thx,

knight_mir.c is:

extern int atoi(const char*);
extern void *calloc(unsigned long nitems, unsigned long size);
extern void *malloc(unsigned long size);
extern int abs(int x);
extern void exit(int status);

extern int printf(const char *format, ...);

#define FILE void
#define stdout __stdoutp
extern FILE * __stdoutp;
extern int fprintf(FILE *stream, const char *format, ...);


#define time_t void
#define SIZEOF_TIME_T 8
extern time_t time(time_t *t);
extern char *ctime(time_t *timer);


int N = 0;
int **b;
time_t *t;

int board() {
	int x, y;
    time(t);
    printf("t: %s\n", ctime(t));
	for (y=0; y<8; y++) {
		for (x=0; x<8; x++)
			printf(" %02d", b[x][y]);
		printf("\n");
	}
	printf("\n");
	return 0;
}

int chk(int x, int y) {
	if (x < 0 || x > 7 || y < 0 || y > 7) return 0;
	return b[x][y] == 0;
}

int go(int k, int x, int y) {
	int i, j, no, x1, y1;
	b[x][y] = k;
	if (k == 64) {
		if (x!=2 && y!=0 && (abs(x-2) + abs(y) == 3)) {
			board();
			N++;
			if (N==1) exit(0);
		}
	}
    else {
		for (i=-2; i<=2; i++)
			for (j=-2; j<=2; j++) {
				if (abs(i) + abs(j) == 3 && chk(x+i, y+j))
					go(k+1, x+i, y+j);
            }
    }
	b[x][y] = 0;
	return 0;
}

int main() {
	int i;
    t = malloc(8);
    time(t);
    printf("t: %s\n", ctime(t));
	b = calloc(8, sizeof (int *));
	for (i=0; i<8; i++)
		b[i] = calloc(8, sizeof (int));
	go(1, 2, 0);
    // never gets this far
}
@DangerMouseB DangerMouseB changed the title Can't get different optimisation levels to work Can't get different optimisation levels to work in c2m May 31, 2023
@vnmakarov
Copy link
Owner

I tried the test. It seems that difference exists but small:

-O0: 23.73user 0.00system 0:23.76elapsed 99%CPU (0avgtext+0avgdata 3680maxresident)k
-O1: 17.90user 0.00system 0:17.92elapsed 99%CPU (0avgtext+0avgdata 3520maxresident)k
-O2: 19.79user 0.00system 0:19.81elapsed 99%CPU (0avgtext+0avgdata 3680maxresident)k
-O3: 19.86user 0.00system 0:19.88elapsed 99%CPU (0avgtext+0avgdata 3680maxresident)k

on bbv branch:

-O0: 19.41user 0.00system 0:19.44elapsed 99%CPU (0avgtext+0avgdata 3520maxresident)k
-O1: 16.63user 0.00system 0:16.65elapsed 99%CPU (0avgtext+0avgdata 3520maxresident)k
-O2: 17.68user 0.00system 0:17.70elapsed 99%CPU (0avgtext+0avgdata 3680maxresident)k
-O3: 17.70user 0.00system 0:17.71elapsed 99%CPU (0avgtext+0avgdata 3840maxresident)k

Imho, the test is call bound and calls are actually not cheap in MIR as they are done through thunks. Thunk on x86-64 is additional pair of insns movabs r,<address>;jmp *r.

-S gives the same output. '-O' currently affects only MIR generator. You only can see the difference when -dg is used.

The optimization options are working as supposed. For me, the interesting result is that -O1 gives the best.

@DangerMouseB
Copy link
Author

Thx so much for looking at that - I had large timing differences in clang (between O0 through to O3) and didn't check in the detail you have as I assumed it might be similar - good to know where the optimisation is happening - apparently clang's llvm output is post optimisation so I think I just assumed the same. I'm on macos aarch64 (M1) if that's at all relevant and for what it's worth the MIR version is a little faster than my QBE version.

You say calls are expensive - I'm assuming you are a ruby enthusiast? My target is a functional style language - think Smalltalk with multi-dispatch (mainly static but with unions it needs a bit of dynamic) so pretty much everything is a function call, (i.e. what would be a method in Smalltalk) including for loops and if then else etc. I have the inference, REPL, and multi-dispatch working (in Python) and a noddy AST interpreter. I haven't really thought through how the code gen should work but I am anticipating dispatch to C-ABI compliant functions as the norm and using inlining in the front end quite a bit.

This is something for me to think more about I suspect.

@vnmakarov
Copy link
Owner

Thx so much for looking at that - I had large timing differences in clang (between O0 through to O3) and didn't check in the detail you have as I assumed it might be similar - good to know where the optimisation is happening - apparently clang's llvm output is post optimisation so I think I just assumed the same. I'm on macos aarch64 (M1) if that's at all relevant and for what it's worth the MIR version is a little faster than my QBE version.

I guess that Clang and GCC can inline abs function (although it is a library function) calls used a lot in this program. That is why there is the large time difference for them.

MIR inlining is quite rudimentary and does not depend on optimization level. There are a lot of ways to improve it, e.g. inlining in hot regions like loops. Currently MIR inlining is done on the 1st calls until some threshold achieved. Good inlining is quite complicated and requires a big infrastructure which is absent in MIR.

I think about implementing calls w/o thunks for some cases. But it requires a considerable work. Now you can call one function directly by using _MIR_get_thunk_addr existing currently on bbv branch.

You say calls are expensive - I'm assuming you are a ruby enthusiast?

I'd not say that. I use myself Ruby very rare although I like some aspects of the languages. But I found Ruby performance improvement as the most hard problem I saw. Therefore I did some research work on improving Ruby performance for last 6 years. But it seems, Shopify YJIT is progressing fine. It uses more pragmatic approach starting with one target Ruby specific machine code generation and now the Shopify team is introducing IR to do optimizations. I started from opposite direction by creating general purpose multi-target MIR JIT engine and tried to use it for CRuby. I am too late with this approach and man power I can invest into Ruby JIT is tiny in comparison of Shopify team resources. So recently I decide not to work on Ruby anymore.

@DangerMouseB
Copy link
Author

One of the interesting things for me about using an IR is the case where more structure can be expressed in a high level language that can be exploited in code gen. Haskell I believe does this sort of thing. A trivial example would be add(x,y) which could be inlined. A more interesting example might be map(filter(xs, f1),f2) (or in pipeline style xs filter f1 map f2) which can be expressed as a while / for loop. e.g. (in Python)

answer = []
for x in xs:
    if f1(x): 
        answer.append(f2(x))

Where the types are:
xs: array
f1: T1 -> bool
f2: T1 -> T2
answer: array

Thus you can image a composable language at the frontend but with an efficient looping implementation.

How hard would it be to translate non-strict SSA style into strict SSA style - this would make the frontend's job easier and I imagine might be generally applicable. This is one of the things I like about the QBE IR as it makes the distance between the higher level front end language and the IR shorter.

Some other problems I (personally) need to solve are:

  • "zero-cost" exceptions so the non-happy path can be handled generically at runtime (perfect for data-science and exploratory use when you want to run happy path efficiently and are comfortable with runtime type errors)
  • generics (for control structures to handle types opaquely)
  • JIT template instantiation (as and when you infer a type signature you don't have the code for yet)
  • debug info - ideally compatible with CLion / clang
  • memory management (arenas, tracing gc, optimised copy on write)
  • profiling

Some of those probably need support from the backend, and some can be done in a library.

When the user has profiling info they can add inline hints (my convention is to suffix function names with _ to indicate laziness or suggest inlining) etc. For example the user can change add to add_ and the frontend can inline the IR or indicate to the backend that this specific call should be inlined.

I decide not to work on Ruby anymore.
I read that.

Btw do you have a debugger for the MIR interpreter?

@vnmakarov
Copy link
Owner

One of the interesting things for me about using an IR is the case where more structure can be expressed in a high level language that can be exploited in code gen. Haskell I believe does this sort of thing. A trivial example would be add(x,y) which could be inlined. A more interesting example might be map(filter(xs, f1),f2) (or in pipeline style xs filter f1 map f2) which can be expressed as a while / for loop. e.g. (in Python)

answer = []
for x in xs:
    if f1(x): 
        answer.append(f2(x))

Where the types are: xs: array f1: T1 -> bool f2: T1 -> T2 answer: array

Thus you can image a composable language at the frontend but with an efficient looping implementation.

How hard would it be to translate non-strict SSA style into strict SSA style - this would make the frontend's job easier and I imagine might be generally applicable. This is one of the things I like about the QBE IR as it makes the distance between the higher level front end language and the IR shorter.

Besides using existing JIT compilers, there are a lot of work to do to implement a good perfromance JIT. The more dynamic language the more work to do. Inlining higher order functions could be considered such work too. Unfortunately, it is not easy to implement a general pruprose JIT supporting most of the features.

QBE is interesting but according to my benchmarks it is slower and generates slower code (about 20% slower than MIR).
You can find my measurements in one my blogpost https://developers.redhat.com/blog/2021/04/27/the-mir-c-interpreter-and-just-in-time-jit-compiler#how_the_mir_c_compiler_compares_with_other_c_compilers. I used cproc for this which is based on QBE.

Although the bigger range of simplicity/code generation performance spectrum for existing JITs, the better it is for JIT developers.

Some other problems I (personally) need to solve are:

* "zero-cost" exceptions so the non-happy path can be handled generically at runtime (perfect for data-science and exploratory use when you want to run happy path efficiently and are comfortable with runtime type errors)

* generics (for control structures to handle types opaquely)

* JIT template instantiation (as and when you infer a type signature you don't have the code for yet)

* debug info - ideally compatible with CLion / clang

* memory management (arenas, tracing gc, optimised copy on write)

* profiling

Some of those probably need support from the backend, and some can be done in a library.

When the user has profiling info they can add inline hints (my convention is to suffix function names with _ to indicate laziness or suggest inlining) etc. For example the user can change add to add_ and the frontend can inline the IR or indicate to the backend that this specific call should be inlined.

I decide not to work on Ruby anymore.
I read that.

Btw do you have a debugger for the MIR interpreter?

It is possible to switch on interpreter insn tracing by using macro MIR_INTERP_TRACE. I used it myself several times for MIR debugging.

I finally thinking to start working on debugging support of c2m and MIR in standard debuggers like gdb and lldb. But it will definitely be not implemented for the next release. Without this it is hard to develop and use MIR even for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants