Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulation hangs for longer running functions using the vector extension #2890

Open
camel-cdr opened this issue Apr 16, 2024 · 5 comments
Open
Labels
bug report Bugs to be confirmed

Comments

@camel-cdr
Copy link

Recently RVV support was merged into the master branch, and I tried running a few of my benchmarks on it, but ran into problems. Only very basic RVV functions worked, the others seem to silently hang the simulation.

For the following I've modified the $AM_HOME/apps/hello example code, and added asm.S to SRCS in the Makefile.
I've attached my entire reproducible docker setup at the end of the issue.

Here are two of the programs that hang the simulation indefinitely:

// asm.S
.text
.balign 8
.global ascii_to_utf16
ascii_to_utf16:
1:
	vsetvli t0, a2, e8, m1, ta, ma
	vle8.v v0, (a1)
	vsetvli x0, x0, e16, mf2, ta, ma
	vzext.vf2 v8, v0
	vse16.v v8, (a0)
	add a1, a1, t0
	sub a2, a2, t0
	slli t0, t0, 1
	add a0, a0, t0
	bnez a2, 1b
	ret
// hello.c
#include <klib.h>
size_t ascii_to_utf16(uint16_t *dst, uint8_t *src, size_t n);
int main(void) {
	static uint8_t src[100] = {1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0};
	static uint16_t dst[sizeof src]={};
	printf("beg\n");
	ascii_to_utf16(dst, src, sizeof src);
	printf("end\n");
	return 0;
}
# asm.S
.text
.balign 8
.global LUT4
LUT4:
	li t0, 16
	vsetvli zero, t0, e8, m1, ta, ma
	vle8.v v0, (a0)
1:
	vsetvli a0, a2, e8, m1, ta, ma
	vle8.v v8, (a1)
	vand.vi v8, v8, 15
	vrgather.vv v16, v0, v8
	vse8.v v16, (a1)
	sub a2, a2, a0
	add a1, a1, a0
	bnez a2, 1b
	ret
// hello.c
#include <klib.h>
size_t LUT4(uint8_t lut[16], uint8_t *ptr, size_t n);
int main(void) {
	static uint8_t mem[100];
	static uint8_t lut[16] = { 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6 };
	printf("beg\n");
	LUT4((uint8_t *)lut, mem, sizeof mem);
	printf("end\n");
	return 0;
}

The problems only seem to occur with a larger iteration counts, e.g. the ascii_to_utf16 code works fine when processing 80 instead of 100 elements. This seems to indicate that there might be a problem with a scheduler or internal buffer filling up?

Since I also ran into problems on other implementations, I've got a quick instruction testing script that executes random instructions. However, the ~50 trials of short random instruction streams I've tested didn't run into any problems.
That's good and points towards this being a single problem, that seems to only occur with longer runs.

Environment Reproduction

I've used the following Dockerfile to build the repository on top of the latests commit to master.
It was run when 0c00289 was the latest commit, since they there is only a single new one, that doesn't look like it would fix the problem, since it's a tiny adjustment to the LSU.

FROM ubuntu:22.04
RUN apt-get update && apt-get install -y build-essential clang libclang-dev llvm-dev cmake libspdlog-dev vim git libmlpack-dev curl wget time default-jre default-jdk
RUN git clone --recursive https://github.com/OpenXiangShan/xs-env

WORKDIR /xs-env
RUN sed 's/apt\S* install/\0 -y/g;s/source /. /g;s/sudo //g' -i ./*.sh && echo 1
RUN . ./env.sh && sed 's/$/; cd \/xs-env/g' -i ./update-submodule.sh && ./update-submodule.sh
RUN . ./env.sh && ./setup-tools.sh
RUN . ./env.sh && . ./install-verilator.sh
RUN . ./env.sh && sed 's/^git submodule.*$//g;s/env.*$//g' -i ./setup.sh && . ./setup.sh

RUN . ./env.sh && make -C XiangShan init
RUN . ./env.sh && make -C XiangShan emu CONFIG=DefaultConfig MFC=1 -j 8
RUN . ./env.sh && sed 's/unknown-//g;s/rv64gc/rv64gcv/g' -i $AM_HOME/am/arch/isa/riscv64.mk
# Once in the docker enviroment, I used the following to build and simulate the programs:
# source env.sh
# cd $AM_HOME/apps/hello
# make ARCH=riscv64-xs
# $NOOP_HOME/build/emu --no-diff -i ./build/hello-riscv64-xs.bin 2>/dev/null

PS: I've also ran into problems with rdcycle not working properly with vector instructions, a loop with 10x more iterations took fewer cycles than one with fewer iterations. Is rdcycle supposed to work with vector instruction in the current implementation? I'll have to investigate this further, and share reproducible code.

@camel-cdr camel-cdr added the bug report Bugs to be confirmed label Apr 16, 2024
@camel-cdr
Copy link
Author

camel-cdr commented Apr 17, 2024

Update I tried running it on a few other branches:
adc944d tmp-backend-fixtiming-merge-master: same problem
824af1e vlsu-240315: same problem but worse, even a lower iteration count froze the simulation.

I also ran the current master with the MinimalConfig, instead of DefaultConfig.
This caused ascii_to_utf16 to run flawlessly, but the LUT4 program hit an assertion instead of stalling:

# LUT4 error:
Assertion failed at line 170883.
The simulation stopped. There might be some assertion failed.
Core 0: ABORT at pc = 0x80001248
instrCnt = 278, cycleCnt = 3643, IPC = 0.076311
Seed=0 Guest cycle spent: 3646 (this will be different from cycleCnt if emu loads a snapshot)
Host time spent: 8451ms
[ERROR][time=                3645] TOP.SimTop.l_soc.core_with_l2.core.frontend.ftq:
commit cfi can be non c_commited
Assertion failed
    at LogUtils.scala:54 assert(false.B)

Other functions still seem to stall though, e.g.:

# asm.S
.text
.balign 8
 # generated by clang, see: https://github.com/camel-cdr/rvv-bench/blob/main/bench/mandelbrot.S
.global mandelbrot_rvv
mandelbrot_rvv:
	beqz a0, rvv_13
	beqz a1, rvv_9
	li a7, 0
	fcvt.s.wu fa5, a0
	lui a3, 262144
	fmv.w.x fa4, a3
	fdiv.s fa5, fa4, fa5
	lui a3, 785408
	fmv.w.x fa4, a3
	lui a3, 784384
	fmv.w.x fa3, a3
	lui a3, 264192
	fmv.w.x fa2, a3
	slli a6, a0, 2
	j rvv_4
rvv_3:
	addi a7, a7, 1
	add a2, a2, a6
	beq a7, a0, rvv_13
rvv_4:
	fcvt.s.wu fa1, a7
	mv t0, a0
	j rvv_6
rvv_5:
	slli a3, t0, 2
	add a3, a3, a2
	vsetvli zero, zero, e32, m1, ta, ma
	vse32.v v8, (a3)
	beqz t0, rvv_3
rvv_6:
	vsetvli t1, t0, e32, m1, ta, ma
	sub t0, t0, t1
	vmset.m v0
	vmv.v.i v8, 0
	viota.m v10, v0
	vadd.vx v10, v10, t0
	vfcvt.f.xu.v v10, v10
	vfmv.v.f v12, fa1
	vfmul.vf v10, v10, fa5
	vfadd.vf v10, v10, fa4
	vfmul.vf v12, v12, fa5
	vfadd.vf v12, v12, fa3
	vmv.v.i v18, 0
	li a3, 1
	mv a5, a1
	vmv.v.i v14, 0
	vmv.v.i v16, 0
	vmv.v.i v20, 0
rvv_7:
	vsetvli zero, t1, e8, mf4, ta, ma
	vfirst.m a4, v0
	bltz a4, rvv_5
	vsetvli zero, zero, e32, m1, ta, ma
	vfadd.vv v22, v16, v20
	vmflt.vf v0, v22, fa2
	vfsub.vv v16, v16, v20
	vfadd.vv v18, v18, v18
	vfadd.vv v22, v16, v10
	vfmadd.vv v14, v18, v12
	vfmul.vv v16, v22, v22
	vfmul.vv v20, v14, v14
	vmerge.vxm v8, v8, a3, v0
	addi a5, a5, -1
	addi a3, a3, 1
	vmv.v.v v18, v22
	bnez a5, rvv_7
	j rvv_5
rvv_9:
	slli a3, a0, 2
rvv_10:
	mv a4, a0
rvv_11:
	vsetvli a5, a4, e32, m1, ta, ma
	sub a4, a4, a5
	vmv.v.i v8, 0
	slli a5, a4, 2
	add a5, a5, a2
	vse32.v v8, (a5)
	bnez a4, rvv_11
	addi a1, a1, 1
	add a2, a2, a3
	bne a1, a0, rvv_10
rvv_13:
	ret
// hello.c
#include <klib.h>
void mandelbrot_rvv(size_t width, size_t maxIter, uint32_t *res);
int main(void) {
	#define W 10
	static uint32_t img[W*W] = {0.0f};
	printf("beg\n");
	mandelbrot_rvv(W, 20, img);
	printf("end\n");
	return 0;
}

Update:

Retested on newer branches:

7fd388c: all problems persist

78c76c7: all problems persist

7390003: all problems persist

@huxuan0307
Copy link
Contributor

Thank you for your bug report, we are handling this.

@Tang-Haojin
Copy link
Member

The vector extension is still work-in-progress. It may be more stable after Apr. 30.

@camel-cdr
Copy link
Author

camel-cdr commented May 1, 2024

I just tried running it on the development branches, and while it behaved the same on fp-split and new-csr, the mandelbrot and LUT4 code snippets completed successfully on the vlsu-240315 branch using the MinimalConfig, even when increasing the iteration count. ascii_to_utf16 however still hangs on that branch. It does complete however, when I remove the vzext.vf2 v8, v0 instruction, so that might be the cause of this bug.

I'll now try it again on DefaultConfig, and update this comment once it's done building, and I could run the tests.

Update: DefaultConfig still hangs on the vlsu-240315 branch the LUT4 and ascii_to_utf16 code, but mandelbrot works fine even with larger inputs.

Edit: Just tried the vlsu-merge-master-0504, which from what I can tell merges the vlsu-240315 branch with master, and the problems are back. Sounds like it was introduced between those commits.

@Anzooooo
Copy link
Member

Anzooooo commented May 8, 2024

I just tried running it on the development branches, and while it behaved the same on fp-split and new-csr, the mandelbrot and LUT4 code snippets completed successfully on the vlsu-240315 branch using the MinimalConfig, even when increasing the iteration count. ascii_to_utf16 however still hangs on that branch. It does complete however, when I remove the vzext.vf2 v8, v0 instruction, so that might be the cause of this bug.

I'll now try it again on DefaultConfig, and update this comment once it's done building, and I could run the tests.

Update: DefaultConfig still hangs on the vlsu-240315 branch the LUT4 and ascii_to_utf16 code, but mandelbrot works fine even with larger inputs.

Edit: Just tried the vlsu-merge-master-0504, which from what I can tell merges the vlsu-240315 branch with master, and the problems are back. Sounds like it was introduced between those commits.

Thank you very much for your attention to the development of XiangShan and sorry for not replying in time.
At present, the vector extension of XiangShan is under development, and the support for segment instruction is not perfect yet.
Due to the reasons of time and manpower, there are still some problems for the time being. We will conduct the test of rvv-bench in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug report Bugs to be confirmed
Projects
None yet
Development

No branches or pull requests

4 participants