Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[On-prem U250] Host reboots after firesim runworkload. 1.18.0 regression? #1697

Open
3 tasks done
caizixian opened this issue Mar 7, 2024 · 5 comments
Open
3 tasks done
Labels
bug Something isn't working

Comments

@caizixian
Copy link
Contributor

Background Work

FireSim Version and Hash

70ac61491c4531b935cb1964d09b660798ffb4d5

OS Setup

Linux alveo 5.15.0-92-generic #102~20.04.1-Ubuntu SMP Mon Jan 15 13:09:14 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.6 LTS
Release:	20.04
Codename:	focal

Other Setup

I followed the XDMA-based U250 documentation. https://docs.fires.im/en/1.18.0/Getting-Started-Guides/On-Premises-FPGA-Getting-Started/Xilinx-Alveo-U250-FPGAs.html

Prior to following the steps in the documentation, I reverted the FPGA to golden. https://support.xilinx.com/s/article/71757?language=en_US

Current Behavior

Host reboots after the following output.

$ sudo ./FireSim-xilinx_alveo_u250 +permissive   +macaddr0=00:12:6D:00:00:02 +blkdev0=linux-uniform0-br-base.img +niclog0=niclog0 +blkdev-log0=blkdev-log0  +trace-select=1 +trace-start=0 +trace-end=-1 +trace-output-format=0 +dwarf-file-name=linux-uniform0-br-base-bin-dwarf +autocounter-readrate=0 +autocounter-filename-base=AUTOCOUNTERFILE  +print-start=0 +print-end=-1 +linklatency0=6405 +netbw0=200 +shmemportname0=default  +domain=0x0000 +bus=0x01 +device=0x00 +function=0x0 +bar=0x0 +pci-vendor=0x10ee +pci-device=0x903f +permissive-off +prog0=linux-uniform0-br-base-bin
Using: 0000:01:00.0, BAR ID: 0, PCI Vendor ID: 0x10ee, PCI Device ID: 0x903f
Opening /sys/bus/pci/devices/0000:01:00.0/vendor
Opening /sys/bus/pci/devices/0000:01:00.0/device
examining xdma/.
examining xdma/..
examining xdma/xdma0_h2c_0
Using xdma write queue: /dev/xdma0_h2c_0
Using xdma read queue: /dev/xdma0_c2h_0
UART0 is here (stdin/stdout).
TraceRV 0: Tracing disabled, since +tracefile was not provided.
command line for program 0. argc=26:
+permissive +macaddr0=00:12:6D:00:00:02 +blkdev0=linux-uniform0-br-base.img +niclog0=niclog0 +blkdev-log0=blkdev-log0 +trace-select=1 +trace-start=0 +trace-end=-1 +trace-output-format=0 +dwarf-file-name=linux-uniform0-br-base-bin-dwarf +autocounter-readrate=0 +autocounter-filename-base=AUTOCOUNTERFILE +print-start=0 +print-end=-1 +linklatency0=6405 +netbw0=200 +shmemportname0=default +domain=0x0000 +bus=0x01 +device=0x00 +function=0x0 +bar=0x0 +pci-vendor=0x10ee +pci-device=0x903f +permissive-off linux-uniform0-br-base-bin
FireSim fingerprint: 0x46697265
TracerV: Trigger enabled from 0 to 18446744073709551615 cycles
Commencing simulation.

The reboot seems to be a hard reset, and there's no useful kernel log/syslog.

Expected Behavior

Boots Linux

Other Information

No response

@caizixian caizixian added the bug Something isn't working label Mar 7, 2024
@caizixian
Copy link
Contributor Author

1.17.1 works fine.

@caizixian caizixian changed the title [On-prem U250] Host reboots after firesim runworkload [On-prem U250] Host reboots after firesim runworkload. 1.18.0 regression? Mar 7, 2024
@caizixian
Copy link
Contributor Author

Possibly related to #1692

@RealJustinNi
Copy link

Solved like #1695

@caizixian
Copy link
Contributor Author

@RealJustinNi thanks for the link.

In my case, I'm following the getting started guide, and didn't elaborate any design myself. The bitstream flashed is downloaded by FireSim

alveo_u250_firesim_rocket_singlecore_no_nic:
(1.18.0). The memory configuration file is from the same tarball. So I didn't think that reprogramming the memory device is necessary.

Regardless of the above, this still seems to be a regression where the same steps work find on a fresh 1.17.1 checkout.

@RealJustinNi
Copy link

RealJustinNi commented Mar 10, 2024

@caizixian Hello, we have also recently encountered similar issues when running Firesim 1.18.0 and 1.17.1 workloads on the U200 platform about one week ago. Our bitstream is alveo_u200_firesim_rocket_singlecore_no_nic. Interestingly, under version 1.17.1, both firesim infrasetup and firesim runworkload run correctly, and Linux boots up without any problems. However, after successfully executing firesim infrasetup in version 1.18.0, runworkload results in a host freeze issue.

We ultimately resolved the issue by recompiling a buildstream and then proceeding with the FPGA re-programming. After looking at another issue you mentioned, we discovered that it was indeed an issue with the version pointer. Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants