Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs in the provided examples #362

Open
KC-Kevin opened this issue Jun 7, 2023 · 32 comments
Open

Bugs in the provided examples #362

KC-Kevin opened this issue Jun 7, 2023 · 32 comments
Assignees

Comments

@KC-Kevin
Copy link

KC-Kevin commented Jun 7, 2023

Dear developers,

I have some bugs/issues with running the examples design/code that has SVM support (array update, array sum and array init) in main barnch. Currently, I am looking at SVM for single FPGA in the main branch (but I do intend to try SVM for multi-FPGAs in develop branch as well. so, I am also wondering in the timeline to merge into main branch). Basically, the issue is that these three programs can run the first iterations, but are not able to run the second iteration of example code and it just stuck in some place.

For example, for the array sum, here is the output when I run with sudo:

Using PEId 10.
Golden output for run 0: 32640
FPGA output for run 0: 32640

RUN 0 OK
Golden output for run 1: 32640

The first iteration can run, but it stuck at the second iteration. Based on simple printf of arraysum_example.c, it seems stuck at here:

if (tapasco_job_release(j, &r, true) < 0) {
      handle_error();
      ret = -1;
      goto finish_device;
    }

If I change the macro of the arraysum program iteration number into 1. It works fine:

Using PEId 10.
Golden output for run 0: 32640
FPGA output for run 0: 32640

RUN 0 OK
SUCCESS

So, I am wondering what is wrong with the system or building process that make the second run stall.

Here is the build process:

setup the Vivado
sudo apt-get -y install unzip git zip findutils curl default-jdk
sudo apt-get -y install build-essential linux-headers-generic python3 cmake libelf-dev git rpm
git clone https://github.com/esa-tu-darmstadt/tapasco.git
mkdir workspace
cd workspace
../tapasco-init.sh
source tapasco-setup.sh
../toolflow/bin/tapasco-build-toolflow
# following two lines arrayupdate its also tried out with arrrayinit and arrays
run HLS example design: tapasco --kernelDir ../toolflow/examples/kernel-examples/arrayupdate/ hls arrayupdate -p AU250
tapasco compose [arrayupdate x 1] @ 200 MHz -p AU250 --features 'SVM {enabled: true}'
../bin/runtime/tapasco-load-bitstream compose/axi4mm/AU280/arrayinit/001/100.0/axi4mm-AU280--arrayinit_1--100.0.bit
../bin/runtime/tapasco-build-libs --enable_svm
# for build the runtime host interface, I also tried the build individual example in arrayupdate/arraysum/array init, but they failed either. 
sudo ./build/example/C/arraysum/arraysum

System setup:
Code base: latest main branch (d7768b3)
OS: Ubuntu 20.04.6
FPGA: U250
Vivado version: 2022.1

Thanks!

@tsmk94
Copy link
Contributor

tsmk94 commented Jun 7, 2023

Hello @KC-Kevin,

I'm sorry to hear you have issues during setting up TaPaSCo and SVM. Unfortunately, you ran into different issues with our main branch. It is a bit outdated, so we now plan the merge of develop into the main branch until end of June.

Your primary issue is related to Vitis HLS, which changed the control register layout. In #345, we fixed this so that it is running with recent Vitis/Vivado versions as well.

The second issue is that, currently, we do not officially support SVM on the U250. It is actually no good design that bitstream generation probably does not fail, but still not include the extension. We fixed that in the develop branch as well and print a corresponding error message.
However, porting the SVM extension from the U280 to U250 should be as easy as copying the svm.tcl file from toolflow/vivado/platform/AU280/plugins to the corresponding plugin folder of the AU250. However, we cannot test it on the U250 by ourselves.

Which Linux kernel version are you running? If your running a kernel newer than 5.16 you also must use the develop branch until our release since we required a fix due to changes in the Linux kernel as well here.

So my suggestion would be you try again with our develop branch till our release on the master branch. I will leave this issue open so that I can support you if you have further issues or questions.

Best regards

@KC-Kevin
Copy link
Author

KC-Kevin commented Jun 8, 2023

Hi,

Thank you for the support!

I try out in the develop branch (with commit 1f3e6d1). I created plugins folder in toolflow/vivado/platform/AU250 and copy the svm.tcl into the AU250/plugins folder.

The OS kernel version is GNU/Linux 5.15.0-69-generic x86_64. I also checked that CONFIG_DEVICE_PRIVATE=y So, the OS setup satisfy the requirement of SVM.

However, when I load the kernel, the terminal hangs and here is the output from dmesg:

[  978.586853] tapasco device #00 [pcie_device_init_subsystems]: claiming MSI-X interrupts ...
[  978.591616] tapasco device #00 [claim_msi]: got 132 MSI vectors
[  978.591622] tapasco device #00 [pcie_device_init_subsystems]: initializing SVM
[  978.592041] tapasco device #00 [request_device_pages]: request device private page resources
[  978.592823] jump_label: Fatal kernel bug, unexpected op at 0xffffffffc0bca6b2 [0000000063bc9bc5] (e9 f6 05 00 00 != 0f 1f 44 00 00)) size:5 type:1
[  978.592953] ------------[ cut here ]------------
[  978.592955] kernel BUG at arch/x86/kernel/jump_label.c:73!
[  978.593002] invalid opcode: 0000 [#1] SMP PTI
[  978.593040] CPU: 54 PID: 997 Comm: kworker/54:4 Tainted: P           OE     5.15.0-69-generic #76~20.04.1-Ubuntu

One thing I noticed is that it has follow error in the terminal when the bitstream is programmed:
kernel:[ 273.231954] NMI: IOCK error (debug interrupt?) for reason 75 on CPU 0.

The build process is attached at the end in case it is helpful.

Will this issue related to the U250 board? or is it something else. If it because of U250 board, I may consider switch to U280 boards.

Some other follow up questions are:
(1) What is the order of generating the bitstream (using tapasco compose command), building library(using the tapasco-build-lib) and building/switching an new/individual example (e.g. make inside array update)? Because tutorial link from readme and the readme itself is kind of conflicting. I also confuse that whether some operations need to be repeated when a new design is programmed onto FPGAs. I am not sure whether this can cause issues or not.

My understanding is following order should work (with minimum building process/less repetition):

  1. use the ../toolflow/bin/tapasco-build-toolflow to build the system (this only need done once)
  2. use the ../runtime/bin/tapasco-build-lib --enable_svm to build the kernel module and runtime (this only need done once)
  3. generate the bitstream using tapasco compose (this is done for individual design once or when the hardware design change)
  4. unload the tlkm.ko driver, load the bit-stream, hot-reset the pci-e, make the tlkm driver, and reload the new driver (this is done whenever a new design is loaded)
  5. build the software (e.g. runtime/example/C/arrayupdate) that interact with the hardware design and run it.

Is the step 3 a standalone/self-contained step? Because, our system setup is that we have a machine B that is dedicated to the bitstream generation and it can program the bitstream onto FPGA through JTAG. The FPGAs are inserted into machine A's PCI-E slots and the machine A will run the actual program. Two machine have a network file system to share the same folder. So, is it possible to run everything on machine A except the bitstream generation/programing the bitstream part on machine B?

(2) if I want to test SVM in Multi-FPGA, should I also copy the sfpplus*.tcl into the plugins folder?
(3) Based on SVM readme, en able SVM for multi-FPGA, it seems we only need to add SVM enable or PCI-E enable during bitstream generation, will tapasco-load-bitstream program two FPGAs at the same time for multi-FPGA SVM?

Thanks again!

Here is the build command I use:

mkdir workspace && cd workspace
# build the toolflow
../tapasco-init.sh
source tapasco-setup.sh
../toolflow/bin/tapasco-build-toolflow
# build the design from HLS, similar for arrayupdate and array init
tapasco --kernelDir ../toolflow/examples/kernel-examples/arraysum/ hls arraysum -p AU250
tapasco compose [arrayupdate x 1] @ 200 MHz -p AU250 --features 'SVM {enabled: true}'  
# load the bitstream 
../runtime/bin/tapasco-load-bitstream  compose/axi4mm/AU250/arrayinit/001/200.0+SVM/axi4mm-AU250--arrayinit_1--200.0.bit
# compile the library with SVM enabled
sudo ../runtime/bin/tapasco-build-libs --enable_svm
sudo insmod ./build/tlkm/tlkm.ko

@tsmk94
Copy link
Contributor

tsmk94 commented Jun 9, 2023

Let me first answer your follow-up questions:

(1) TaPaSCo consists of two more or less independent parts, the toolflow for creating bitstreams and the runtime to write and execute corresponding software. You can even use different workspaces on distinct machines for this.

So it is no problem to build bitstreams on machine B in your setup. On this machine you only need to do steps 1 (run tapasco-build-toolflow) and 3 (tapasco compose).

Steps 2, 4 and then need to be done on machine A in your setup. There you need to run tapasco-build-libs --enable_svm once, which also builds the example software and puts it on your path. If you have your own software project, your CMakeLists.txt could look similar to this:

include($ENV{TAPASCO_HOME_RUNTIME}/cmake/Tapasco.cmake NO_POLICY_SCOPE)
project(my-project CXX)
find_package(Tapasco REQUIRED)
find_package(Threads)
add_executable(my-project main.cpp)
set_tapasco_defaults(my-project)
target_link_libraries(my-project tapasco ${CMAKE_THREAD_LIBS_INIT})

tapasco-load-bitstream will then do everything you described in step 4 at once. Use the --reload-driver option if you want to unload the driver during bitstream programming. You can also program the bitstream separately using Vivado from machine B and then run tapasco-load-bitstream placeholder.bit --mode hotplug, which does hotplugging only. If you use multi-FPGA setups this is currently the way to go as tapasco-load-bitstream is intended for single-FPGA setups. However, hotplugging works for all attached FPGAs.

(2) No, svm.tcl is sufficient as it is completely self-contained.

(3) See point 1.


Now on your issues:

I did not encounter these two errors yet, and I am also not sure whether they are somehow connected. The complete stack trace would be interesting here to see where exactly this kernel bug is triggered, if it is inside out kernel module at all.

What I see in you build commands is that you compile the runtime library with sudo, which is not required, and that you loaded the driver separately not using our provided script. Maybe this causes problems since hotplugging may not be done correctly then.

As a side note: After source tapasco-setup.sh all required commands should be directly on your path.

I would suggest you try once again with the optimal toolflow I sum up in the following. If this does not solve the issues I would ask you to provide more dmesg output to me so I can try to debug it further. Here would be the summed up toolflow:

On machine B in your setup build the bitstream:

mkdir workspace && cd workspace
../tapasco-init.sh
source tapasco-setup.sh
tapasco-build-toolflow
# build the design from HLS, similar for arrayupdate and array init
tapasco hls arraysum -p AU250

# use --deleteProjects false if you want to keep the Vivado project
tapasco compose [arrayupdate x 1] @ 200 MHz -p AU250 --features 'SVM {enabled: true}'
# optionally program FPGA from here using Vivado

On machine A build the toolflow, load the driver and run the software (distinct workspace possible):

mkdir workspace && cd workspace
../tapasco-init.sh
source tapasco-setup.sh
# use --mode driver_debug for extended output in dmesg
tapasco-build-libs --enable_svm
tapasco-load-bitstream axi4mm-AU250--arrayinit_1--200.0.bit --reload-driver
# or if bitstream is already loaded
tapasco-load-bitstream placeholder.bit --reload-driver --mode hotplug

# export RUST_LOG=info (or even export RUST_LOG=trace) for more debug information
svm-example

Have a nice weekend!

@KC-Kevin
Copy link
Author

Hi,

Thanks for the detailed reply!

I follow the procedures you suggested. Now, I find I do not need sudo and can use the binary name directly (e.g. tapasco-build-lib, instead of ../runtime/bin/tapasco-build-lib). I created a clean copy of the repo (develop branch with commit 1f3e6d1) and re-generated the bitstream.

However, I encountered the same issue and I attached the full dmesg log at the end.
the dmesg logs are generated when:
(1) before program the FPGA
(2) after program the FPGA
(3) after do hot reset and re-load the driver

Here is the output of the terminal when loading the driver/hot-reset

Message from syslogd@dpe-1 at Jun  9 14:01:54 ...
kernel:[ 1181.735650] NMI: IOCK error (debug interrupt?) for reason 75 on CPU 0.

Here is the command I use:

On machine B:

git clone https://github.com/esa-tu-darmstadt/tapasco.git ./tapasco-retry
git checkout develop
#created plugins folder in toolflow/vivado/platform/AU250 and copy the svm.tcl into the AU250/plugins folder
mkdir workspace && cd workspace
../tapasco-init.sh
source tapasco-setup.sh
tapasco-build-toolflow
tapasco hls arraysum -p AU250
tapasco compose [arraysum x 1] @ 150 MHz -p AU250 --features 'SVM {enabled: true}'
# get the dmesg output before program FPGA (dmesg_before_programm_fpga.log file below)
program the bitstream in Vivado hardware manager
# get the domes output after program FPGA (dmesg_after_program_but_before_load_driver.log file below)

On machine A:

mkdir workspace_runtime && cd workspace_runtime
../tapasco-init.sh
source tapasco-setup.sh
tapasco-build-libs --enable_svm --mode driver_debug 
# command below is purely used to hot reset and reload the driver, but it stuck
tapasco-load-bitstream bitstream_programmed_before.bit --reload-driver --mode hotplug
# the above command hangs and can not proceed. 
# get the dmesg again  (dmesg_after_program_and_after_load_driver.log file below)
# the dmesg indicate the kernel error

Since you said the bitstream generation and the runtime is relatively independent, I run commands on machine B and A concurrently, except that I first program bitstream from Vivado in machine B and then run tapasco-load-bitstream bitstream_programmed_before.bit --reload-driver --mode hotplug

Here is the full dmesg output:
dmesg_after_program_and_after_load_driver.log
dmesg_after_program_but_before_load_driver.log
dmesg_before_programm_fpga.log

Thanks again!

@KC-Kevin
Copy link
Author

Hi,

I just did another experiment, following the suggested order, but with an extra step of warm-reboot of machine A after program FPGA through machine B's vivado JTAG:

On machine B:

# previous steps are the same with previous reply. so, re-use the same bitstream generated
program the bitstream in Vivado hardware manager

On machine A:

# after program the bitstream from machine B, warm reboot machine A 
source tapasco-setup.sh
# only do the hot-reset here
tapasco-load-bitstream bitstream_programmed_before.bit --mode hotplug
# the hot-reset stuck 

The output of hot-plug is:

hotplugging device: c4:00.0
hotplugging finished

There are two U250s in machine A. The FPGA devices in dmesg should be c4:00.0 and c1:00.0. The one get programmed in Vivado should be c4:00.0

Here is the full dmesg log after the program get stuck at hot-plug:
dmesg_after_program_warm_reboot_hot_reset.log

@KC-Kevin
Copy link
Author

Hi,

I would like to provide more information/things I tried from my side to help identify the issues.

Here is the screen shot of Vivado after we program the arraysum bitstream. It shows the memory controller is properly calibrated
vivado status

I also tried the tapasco-build-libs --rebuild --mode debug suggested in debugging documentation. However, this does not help, because the error happens when the driver is loaded/hot-reset the device.

Another question is: if I can use JTAG to read/write CSRs (control status register) to check status of hardware?

Thanks again!

@tsmk94
Copy link
Contributor

tsmk94 commented Jun 12, 2023

Hi,

thanks for the additional information. I suspect an issue in the Linux kernel itself. As you can see in the stack trace in the dmesg log, the error (or BUG as printed in the error message), occurs in the Linux kernel. The function which is called in our driver is devm_memremap_pages(). After that a lot more subfunctions are called inside the kernel, until the error occurs. I dived into Linux kernel code, however, could not find out what is causing this bug yet. I will continue looking in it in the next days, but this may be below of the Tapasco driver, which would be quite unpleasant.

Maybe some additional question to find out, what is different in your setup than in ours. Does your Ubuntu run in a virtual machine? Do you have other PCIe devices than the U250s plugged in? What is loaded on the other U250? How is the chance that you can install a newer Linux kernel? I do not know if this would solve the issue, however, we are running newer kernel versions currently on our machines.

@KC-Kevin
Copy link
Author

Hi,

Thanks for the reply!

Here is the answer to your question:
(1) I do not have the VM on the machine/ubuntu, and I use the card in a bare metal way.
(2) On machine A (the execution machine), I have two U250s installed. These are the only PCI-E devices I have on the machine.
(3) Previously, the other U250 is load with Xilinx XRT. In regard to your suggestion of multi-FPGAs different personalities, we also tried to program the both U250s with array init bitstream (with SVM-enabled) and do a warm reboot. However, after I re-build the lib with tapasco-build-libs, the same dmesg error happens when I load the driver.
(4) Which OS version and kernel version are you using? Because you mentioned that you have a newer version installed. Currently, I have Ubuntu 18.04 and 5.15.0-69-generic x86_64.

Another thing I tried is enable the IOMMU. I get the same error message on the kernel bug. Beyond the same issue I already have, the dmesg now give more errors information at the end. Attached is the full dmesg I have here.

iommu_enabled.log

Thanks again!

@tsmk94
Copy link
Contributor

tsmk94 commented Jun 13, 2023

Hi,

you could also try to run a bitstream with SVM disabled and see if at least this works.

(4) We are using RedHat Enterprise Linux 9 with kernel version 6.3.2 currently.

@KC-Kevin
Copy link
Author

Hi,

I am considering setup a new OS now in our machine. Will the latest Rocky Linux 9.2 with OS kernel version 5.14.0 work as an alternative of OS you have now? The main concern here is that your OS kernel version is 6.3.2 in RedHat.

Also, did you test the SVM feature part on Ubuntu with OS kernel version 5.15.x before?

Currently, if I use the main branch without SVM, I am able to get the driver loading work and start the program, but there is still issues (detailed in the very beginning of the issue page). Another reason is that I would like to try out the SVM features (single-FPGA and multi-FPGA) in the system.

Thanks again!

@KC-Kevin
Copy link
Author

Hi,

I would like to update with more information on the debugging process. We fix the timing issue of the bitstream generated.

For bitstream without SVM, I am able to get the array sum example work and the example software can run through.
However, for bitstream with SVM, the same issue still exist when I load the driver. (the dmesg error of the kernel bugs)

Thanks again!

@KC-Kevin
Copy link
Author

Hi,

I would like to update with more information on the debugging process. I installed Rocky Linux 9 with OS kernel version 5.14.0-284.11.1.el9_2.x86_64.

Now, the driver loading with SVM feature enabled is good. I can program the bitstream (with correct timing), do warm reboot and load the driver with SVM without any issue. However, when I run the program, it just get the stuck. I attached the dmesg log here for your reference. Also, the 84:00.2 device is a NIC, and its error message may not matter so much.

The bugs happened with and without IOMMU enabled. Currently, the attached dmesg log is with IOMMU enabled.

rocky_linux.log

Another question is how to run counter examples? I am able to synthesis it successfully, but I do not see a good instruction/host code to interact with it. Because I think get it running may help the debugging process.

Thanks again!

@tsmk94
Copy link
Contributor

tsmk94 commented Jun 14, 2023

Hi,

thank you very much for your additional debugging effort!

The new kernel at least solved the issue which was unrelated to the actual TaPaSCo code. As background information: What I can see from your log, is that there are CPU page faults on the same page again and again, so the migration from device memory to host memory seems to not succeed. I am currently trying to figure out what is different in this particular kernel version than in other versions. In this part of the kernel there is much development activity and changes between different versions. As I remember correctly I started developing two years ago with version 5.13 and used different other versions till now, but of course I could not test every version.

I could not figure out the exact issue until now. However, I remember I had a similar issue with even newer kernel versions, and could fix it by introducing this version check:

#if LINUX_VERSION_CODE < KERNEL_VERSION(5,17,0)
. You could try if this applies to your version as well by adjusting the check to #if LINUX_VERSION_CODE < KERNEL_VERSION(5,14,0), but I cannot exactly be sure if this solves our issue or will introduce other issues. Otherwise I must probably try setting up a system with the exact same kernel version in our lab.

On RockyLinux you can also install newer kernels with kernel-ml (https://wiki.crowncloud.net/?How_to_Install_Kernel_6_x_on_RockyLinux_9). But of course this might not be possible if other users are using the same machine at your lab to reproduce your bugs.

Regarding timing issues in your bitstream: On the U280 we create a pblock for the PCIe core and constrain it to the bottom SLR (see here

create_pblock pblock_axi_pcie
). Maybe this is also an option for the U250. I'm sorry that our U250 support is not as good as for the U280, since we currently do not have a U250 for testing.

I hope we can fix your issue soon and get everything running on your system!

Edit:
You can run the counter with libtapasco_tests run_counter (needs at least 4 counter instances) or libtapasco_tests run_benchmark. The Counter-PE must be imported using ID 14.

@KC-Kevin
Copy link
Author

Hi,

Thank you so much for the continuous help. Currently, I upgraded the OS kernel version to 6.3.7-1.el9.elrepo.x86_64 with Rocky Linux 9. This kernel version is able to successfully load the driver and run the single FPGA SVM example with array sum.

One issue with this OS kernel version is that: when I program the bitstream from machine B using JTAG, the machine A (where the actual runtime environment) will auto-reboot upon programming. I speculate that this is the surprise link down on PCI-E and cause OS to do it automatically. Are you aware of an option to not do an auto reboot in the Does Rocky Linux or the OS kernel version? Maybe there is a linux parameter to set?

Currently, I am trying out the multi-FPGAs SVM feature and want to get some performance number/benchmarking for both single FPGA SVM and multi-FPGA SVM. I have some questions as I tried out the examples:

(1) How to run the bandwidth example properly in runtime/example/C++/bandwidth? Does it require a bitstream on FPGA or not? I program with system with arraysum as IP core. However, I got the following output when I execute it:

�E
terminate called after throwing an instance of 'tapasco::tapasco_error'
  what():  �E
Aborted

Do you have any input on this bugs?

(2) I see runtime/example/C++/arraysum (arrayint and arrayupdate) has the SVM example, and runtime/example/C++/svm has the has a combination of arraysum, arrayint and arrayupdate. Currently, SVM example works for both on-demand and user-managed. However, my understanding is these code only works for single SVM, and runtime/example/C/arraysum does not have SVM support. So, is there more multi-FPGAs software source code and the performance benchmarking code for multi-FPGA SVM? Is there any chance to grant access to the benchmarking code?

(3) Is there any documentation on how to properly program the instance of multi-FPGAs SVM? Naively, I guess program two FPGAs with same bitstream should work. A more complicated cases is, for example, can I have different IP core on two FPGAs and then both program can talk through SVM (with PCI-E endpoint to endpoint or Ethernet)?

(4) The last question is that for Ethernet SVM, in the compose command, is the mac address and port refer to source or destination? If I have two FPGA boards, how should I compose the two program to generate two bitstreams? Is it the same bitstream or not?

Thanks again!

@tsmk94
Copy link
Contributor

tsmk94 commented Jun 15, 2023

Hi,

I'm happy to hear that you finally got it running! I will consider to update the kernel requirement in the documentation.

Regarding the auto-reboot I will have to ask my colleagues. Do you have this issue with non-Tapasco bitstream as well? And did it also occur with Ubuntu or is it related to Rocky Linux?

(1) The bandwidth example requires a bitstream with any PE. However, it is intended for non-SVM bitstreams and does not work with SVM bitstreams currently. WE might check and extend this in the future as it would be a helpful tool for simple benchmarking.

(2) The svm example in our develop branch detects automatically how many devices are attached and which of the arrayinit, arrayupdate and arraysum PEs are available. It then runs the respective tests. If possible it also runs a pipeline of these three PEs distributed over two FPGAs to enforce a direct device-to-device migration. The runtime also detects automatically which options are available to perform the migration (Ethernet, PCIe).
The usual arraysum example works without and with SVM, but then always uses user-managed migrations. This examples are not really suitable for benchmarking as they are way to small. Unfortunately, we do not provide further examples.

(3) Both is possible. You can either have the same PEs, or different PEs on both FPGAs. They can talk through SVM in both cases. The only exception is if you want to use migrations over Ethernet (see 4). Otherwise, there is no issue to use the same bitstream on both FPGAs.

(4) During compose you set the mac address and the QSFP+ slot (port parameter) you want to use in this specific bitstream. So currently, the mac address is hard-coded in the bitstream. This means you always need to generate distinct bitstreams for different FPGAs so that you do not have two times the same mac address in the Ethernet network.

I hope my remarks are helpful for you.

@KC-Kevin
Copy link
Author

Hi,

Thank you so much!

The auto-reboot issue only shows up after I switched from Ubuntu to Rocky Linux (with OS kernel version 6.3.7). When I was experimenting the Tapasco under Ubuntu, such issue did not exist. Only in the current Rocky Linux, this issue pops up, and it also exists for other non-tapasco bitstream in the OS 6.3.7. So, I speculate that there are some OS/kernel parameters to set in the Rocky Linux to disable auto-reboot upon bitstream programming.

Thanks for considering enhancing the bandwidth benchmark with SVM features. That will be helpful.

For SVM implementation, I am currently working on verifying the bitstream and software running correctly. A higher level question is: if the bitstream has 1. PCI-E endpoint to endpoint and 2. Ethernet (of course bouncing through host is the third approach), will driver/tapasco system automatically choose suitable mechanism (1 or 2) during execution of user program(e.g. array init) or is there a way that user can specify the communication mechanism? The question is basically how the communication mechanism is determined give multiple devices.

Do you have any support for substitute the user IP in a more flexible way? Based on my understanding, the user IP (e.g. array sum) is integrated with the system wrapper to generate as a whole. So, if the bitstream is generated with array sum and array update and later I want to change array update IP to array init IP, I have to re-generated the whole bitstream. A more flexible way is to substitute the user IP dynamically (i.e. change from array sum update array init).

Thanks again for all the reply/remark/help!

@KC-Kevin
Copy link
Author

Hi,

A little bit more progress made. Based on the dmesg log below, do I run svm example with pci-e endpoint to endpoint and Ethernet with QSFP successfully?

multi-fpga_svm.log

For example, do lines like below indicate multi-FPGA SVM successfully run with bouncing through host

[ 5440.295908] tapasco device #00 [svm_migrate_to_device]: migrate 1 pages with base address 1fbf000 to device memory
[ 5440.295914] tapasco device #00 [svm_migrate_ram_to_dev]: migrate 1 pages with base address 1fbf000 to device memory

and this line means below indicate multi-FPGA SVM successfully run with pci-e endpoint to endpoint

[ 5440.288297] tapasco device #01 [pcie_svm_user_managed_migration_to_device]: user managed migration to device with vaddr = 1fbf680, size = 400

I do see the log indicating Ethernet successfully runs. Probably, because I set the QSFP port as 0 for both bitstream, but the physical connection is the zig-zag in the machine (FPGA A port 0 connect to FPGA B port 1 and vice versa). I will generate new bit stream that reflect the real connection.

Thanks!

@KC-Kevin
Copy link
Author

Hi,

I did a study on the dmesg log above with pcie_svm.c in the tlkm driver code. Based on the dmesg log, I do not think PCI-E P2P is enabled, because it should print use PCIe P2P copy as in line 1010 shown.

Similarly, for Ethernet, I add the DEVLOG but I do not see a message in dmesg. After an inspection of Vivado, I found MIG calibration passed, but there is no Ethernet IP status shown after the programming. Do you have a way to check Ethernet IP status? Because it need a CSR setup checking Ethernet IP status to work.

Have a good weekend!

Thanks again!

@tsmk94
Copy link
Contributor

tsmk94 commented Jun 19, 2023

Hi,

please, see the answers to your different questions below. I hope I covered everything.

Auto-reboot:
We do not have a solution for the auto reboot issue. In our experience it sometimes happens if a driver is still accessing the device. So make sure to always stop running TaPaSCo applications before loading a new bitstream (in the case you did not anyway).

Copy method:
The runtime checks dynamically which copy method is supported by the loaded bitstreams. The precedence is 1. Ethernet, 2. PCIe E2E, 3. bounce buffer.

Dynamic exchange of IPs:
Unfortunately, we do not support any dynamic replacement or reconfiguration of parts of a bitstream. So you will need to generate new bitstreams for other PEs every time.

SVM example:
I could reproduce your issue. You are right, there is no direct migration between the FPGAs. It is probably caused by the small size of the arrays in this example. There seem to be other objects on the same memory page than the array leading to CPU faults and some ping pong migrations. That is why there is no direct migration from one to the other device. If you make sure that the array is on its own memory page you can see the direct migration in the log. This can be enforced by replacing the following line

auto *arr = new element_type[SZ];

with an aligned allocation (from cstdlib-header):

auto *arr = static_cast<element_type*>(aligned_alloc(4096, 4096));

You will then see svm_migrate_dev_to_dev and the respective copy calls in dmesg. Maybe start with non-Ethernet bitstreams first to see it generally works in the case something with the Ethernet configuration is wrong.

Cheers!

@KC-Kevin
Copy link
Author

Hi,

Thanks for detailed response!

With the aligned allocation, I am able to see the svm_migrate_dev_to_dev message now. However, for PCI-E endpoint-to-endpoint method, the svm example does not pass. I see this from dmesg

[  380.999081] DMAR: [DMA Write NO_PASID] Request device [c4:00.0] fault addr 0xfffff000 [fault reason 0x05] PTE Write access is not set

I also see this from output of sudo lspci -vvv

Region 2: Memory at <unassigned> (64-bit, prefetchable)

So, there are something wrong with PCI-E BAR. Even bouncing through the host is not working. I found disable IOMMU may help to resolve this issue from this link.

So, do you have IOMMU enabled or not in the current system setup? or do you have any debugging tips of running into this issue?

For auto-reboot issue, I tried to unload the driver and it still does a auto-reboot when a new bitstream is programmed. Then, I guess there is no quick/easy fix to this.

Thank you!

@tsmk94
Copy link
Contributor

tsmk94 commented Jun 20, 2023

Hi,

I think these are two distinct issues. The first error message is related to the Intel IOMMU which seems to block DMA to host memory for some reason, if I understand the error message correctly. Similar to you, I found some bug reports where other devices (e.g. GPUs) have similar issues with the Intel IOMMU. We use AMD servers so I did not encounter this by myself yet. But on one server our AMD IOMMU is switched of, so maybe switching it off in your setup could resolve this.

The second problem is that the second PCIe BAR, which is used for direct PCIe E2E memory access, is not assigned. Maybe you can check in your dmesg log if you have any error message related to this? The BAR is quite large with 4 GB, however, this is required. Without having memory assigned, data cannot be written through the PCIe bus from one to another device.

@KC-Kevin
Copy link
Author

Hi,

Thanks for the response!

The first issue is resolved by turning intel IOMMU off. The Ethernet also works.

However, the second issue still exists, and it seems related to some OS parameter. I see following message from dmesg.

[    2.507706] pci 0000:c1:00.0: BAR 2: failed to assign [mem size 0x2000000000 64bit pref]
[    2.507794] pci_bus 0000:c0: Some PCI device resources are unassigned, try booting with pci=realloc

So, I tried with pci=realloc=off and pci=realloc=on. But neither worked, so I am still debugging it.

I do have some other question when I play with examples:

(1) If I understand correctly, the HLS code of array update is in toolflow/examples/kernel-examples/arrayupdate/, but I do not see any pragma in it. A simple pragma like #pragma HLS unroll factor=8 will help with performance. Is it a deliberate choice of not adding pragma or is it just not a HLS code at all?

(2) How do you time the code properly? In svm example code, I tried with inserting the C++ std::chrono::high_resolution_clock timer into the code to wrap around the code

auto update_job = arrayupdate_dev->launch(arrayupdate_id, arr_addr);
update_job();

However, I found changing the array size from 16kb to 4MB does not change the timing result a lot. The variation is around the 20%. For example, changing from 16kb array auto *arr = static_cast<element_type*>(aligned_alloc(4096, 4096*4)); into 4MB array
auto *arr = static_cast<element_type*>(aligned_alloc(4096, 4096*1024)); does not lead to a proportional change in the latency. I attached a timing log at the end for your reference.

I noticed that HLS code also define the array size by SZ, so I also re-synthesis the HLS with 2050 element in it for arrayupdate/arraysum/arrayinit. However, I do not see a double of time reported by C++ timer.

Changing both the array size in HLS kernel or the host code does not change the latency much. Does it mean the launch function execute asynchronously, so the timing result is not accurate. If so, how should I do the timing/benchmarking properly?

I sometime observe up to 20% of variation when I do the timing for exactly the same command, is it a common experience?

(3) Since the tapasco has Ethernet between two FPGAs. Do you have support of two FPGAs that are located in two different nodes such that one fpga in node A can talk with the other fpga in node B through Ethernet? Based on the paper/github readme, I do not see a support of across node FPGAs' SVM, but I just want to confirm on this point.

(4) Do you support huge page like 2MB? Since there is HMM integration, I guess the answer is no. But I would like to get a confirmation.

Thanks again!

Appendix:
Timing with 4MB array in host code (with Ethernet as P2P) and SZ is HLS array is 2050

execution time (ns) of array init (cp data with user-managed from host to fpga 1): 193691
execution time (ns) of array update (cp data from fpga 1 to fpga 2 with on-demand): 322678
execution time (ns) of array update (cp data from fpga 2 to fpga 1 with on-demand): 876961
execution time (ns) of array update (no data movement, only compute time): 52914

Timing with 16Kb array in host code (with Ethernet as P2P) and SZ is HLS array is 2050

execution time (ns) of array init (cp data with user-managed from host to fpga 1): 176050
execution time (ns) of array update (cp data from fpga 1 to fpga 2 with on-demand): 289793
execution time (ns) of array update (cp data from fpga 2 to fpga 1 with on-demand): 258337
execution time (ns) of array update (no data movement, only compute time): 52716

The similar timing result is observed with SZ=256 in HLS code

@tsmk94
Copy link
Contributor

tsmk94 commented Jun 22, 2023

Hi,

(1) These example kernels are not optimized in any way. They should only demonstrate how to use TaPaSCo in general.

(2) The tapasco->launch(...) and job() calls are synchronous. So if you start measuren before launch() and stop after job(), you are measuring the actual runtime of the PE including data transfers. Only if you use on-demand page migrations, you might not measure the time required for the back migration to host, because it is only performed after a CPU page fault on the respective page.

Enlarging the allocated host buffer only does not affect latency at all, because it does not imply it is also completely migrated to device memory. This is set by the makeWrappedPointer() or tapasco.copy_to() call, or by the data which is actually touched by the PE.

In order to see a change in latency, you have to explicitely enforce migration of a larger buffer and/or modify the SZ parameter of the HLS core so that it works on a larger data set.

However, the problem sizes of these example cores are very small. So runtime of the HLS core does not really take into account, and it does not really matter if you migrate one, two or three pages, because the overhead of launching the PE, migrating the data and handling the interrupt(s) is too large. So you have to increase the problem size much further until you can neglect these effects.

(3) No, all FPGAs must be in the same node as they all share the address space of the same host application and need to be managed by one driver.

(4) No is the correct answer. As far as I know, HMM is still limited to 4 kB pages.

I hope I could clarify your questions.

@KC-Kevin
Copy link
Author

Hi,

Thanks for the response! I will exploring performance measurement with your suggestion.

For on-demand latency measurement, does that mean only bouncing through host may not include the CPU page fault time. On the other hand, for on-demand latency measurement that use PCI-E endpoint-to-endpoint or Ethernet, the measurement should be accurate, because it does not go through host.

As I explore the examples/documentation, I have following two questions:

(1) All HLS examples I saw use static array. Is there any support for dynamic array? By dynamic array, here is an example provided in Vitis kernel bandwidth example. I tried to port this example into the TaPaSCo framework. The HLS kernel synthesis is fine. When it generate the bitstream, the vivado got error like

Attempting to get a license for feature 'Implementation' and/or device 'xcu250'
Running DRC as a precondition to command opt_design
Starting DRC Task
ERROR: [DRC INBB-3] Black Box Instances: Cell 'system_i/memory/mmu_sc/inst' of type 'bd_c4b6' has undefined contents and is considered a black box.  The contents of this cell must be defined for opt_design to complete successfully.
ERROR: [Vivado_Tcl 4-78] Error(s) found during DRC. Opt_design not run.

I am still debugging this issue. But it may be helpful if you have any insight on dynamic array in HLS.

(2) In HLS kernel.json file, there are two way provided to pass argument: value and reference. For passing by reference, how to define the multiple AXI interface properly. In the documentation on HBM, I saw a way to define different AXI interfaces. If only regular DRAM is used, and the argument is passed by reference, is there a way to do specify multiple AXI interfaces as what has done in HBM. Also, if that is possible, where should I put those specification.

Thanks! Have a good weekend!

@KC-Kevin
Copy link
Author

Hi

Another quick question I have is on the semantic of argument passing in kernel.json for HLS. I notice the passing filed in argument of kernel.json is either value or reference. What is the meaning/semantic of this field. Also, what is the relationship/connection with function argument passing in the function signature of HLS code. Because, in HLS, for example, the array is usually passed as pointer. Some time, the output may be treated as a reference with streaming interface. I am confused with the argument type used by function signature itself and the pass by reference/value used by json file.

Thanks!

@tsmk94
Copy link
Contributor

tsmk94 commented Jun 26, 2023

Hi,

(1) TaPaSCo only uses Vitis/Vivado tools. Hence, everything which is supported by Vitis you can use, please consult Vitis documentation for dynamic-sized arrays.

(2) tapasco hls allows two different ways of passing arguments:

  • by-value is for scalar values which are passed directly to a configuration register via an AXI slave interface
  • by-reference is for array arguments. This generates an AXI master port, and the memory address is passed to a configuration register via the AXI control slave

AFAIK tapasco hls will generate one AXI master per array argument. If you want to have more control over this, you can also use Vitis HLS to generate your IP and import the generated ZIP-file as long as the generated IP is compatible to TaPaSCo regarding ports and configuration register layout.
As there is only one DRAM port (and only one MMU in case of SVM), multiple AXI masters are always merged to one AXI in an interconnect tree. The mentioned HBM extension is a special case, however, we do not support DMA to HBM in this extension.

@KC-Kevin
Copy link
Author

Hi,

Thanks for the response!

You mentioned that everything which is supported by Vitis you can use, and I tried out specifying the input argument as a pointer in the HLS function definition. Using pointer is a common practice in HLS and it should work. However, the test code that adapted from Vitis example that use the pointer can generate bitstream and my host program can run, but it will not produce the desired output (the output value is untouched).

To further eliminate other changing factor and verify whether the pointer in function definition will work or not, I also change the way to pass array in array update example from arrayupdate(int arr[SZ]) into arrayupdate(int* arr). The rest of the code is kept exactly the same as given example, including passing by reference in kernel.json file and SZ=256. The only change I made is from arrayupdate(int arr[SZ]) into arrayupdate(int* arr) and passing by pointer should work in HLS. However, I got the following error below. The error still shows even after I re-program the bitstream and do a warm reboot.

[ 631.577760] tapasco device #00 [svm_migrate_ram_to_dev]: could not find matching VMA for address 0x7ffd72cc1000
[ 631.577767] tapasco device #00 [drop_page_fault]: drop page fault: vaddr = 7ffd72cc1000
[ 631.577770] tapasco device #00 [handle_iommu_page_fault]: error during page fault, disabling IOMMU, please reload bitstream

The question I have is about the proper way to work with passing by pointer in function definition. Because passing by pointer can associate an internal buffer with DDR array and it enable me to work with dynamic size array. Am I missing something in using HLS? Since you mentioned the ports and configuration register layout may be a problem, so I also checked the AXI interface generated in HLS. It is AXI_MM, which should work.

The last question is on static array: do you have any limitation on the size of it? Because, I tried to allocate 32MB array in HLS code without changing anything else, but the program got stuck again. I am currently debugging this issue with 1 MB static array in HLS code, but your insight will be helpful.

Thanks again!

@tsmk94
Copy link
Contributor

tsmk94 commented Jun 28, 2023

Hi,

the error message indicates that your PE tries to access a virtual memory address which is not backed by a physical memory page, so probably has not been allocated on the host before. All memory you want to access must be properly allocated on the host first, otherwise the TaPaSCo driver cannot migrate the pages to device memory.

Regarding your general issues with Vitis HLS, please consult the official AMD/Xilinx documentation.

@KC-Kevin
Copy link
Author

Hi,

Thanks! I think a simple array copying from input array to output array is working with pass by pointer in a single FPGA host setup.

As I am testing the user-managed and on-demand page migration, I would like to know whether the system sets any limit on the size of array to move between host and device other than the U250 board constraints. Because, for user managed, I am able to copy a 1GB array from host to a single device. For on-demand, I am able to copy a 32 MB array from host to a single device. When I further double the size of array, the user-managed or on-demand fails. The program stuck and the dmesg log looks like:

[17664.027224] tapasco device #01 [init_h2c_dma]: initiate H2C DMA: host addr = 0, device addr = 2089000, length = 1, clear = 1, network = 0
[17664.027231] tapasco device #01 [add_tlb_entry]: add TLB entry: vaddr = 7f6f22241000, paddr = 2089000
[17664.027234] tapasco device #01 [insert_vmem_interval]: insert vmem interval at 0x7f6f22241000 with 1 pages
[17664.027237] tapasco device #01 [svm_migrate_ram_to_dev]: migration to device memory complete
[17664.027259] tapasco device #01 [svm_migrate_to_device]: migrate 1 pages with base address 7f6f24243000 to device memory
[17664.027262] tapasco device #01 [svm_migrate_ram_to_dev]: migrate 1 pages with base address 7f6f24243000 to device memory
[17664.027267] tapasco device #01 [svm_migrate_ram_to_dev]: failed to collect all pages for migration
[17664.027272] tapasco device #01 [drop_page_fault]: drop page fault: vaddr = 7f6f24243000
[17664.027275] tapasco device #01 [handle_iommu_page_fault]: error during page fault, disabling IOMMU, please reload bitstream

Any insight/suggestion is appreciated!

Thanks!

@tsmk94
Copy link
Contributor

tsmk94 commented Jul 5, 2023

Hi,

I tried to reproduce your issue with 64 MB buffers, however, it worked on my side. The given error message indicates that the Linux Kernel cannot resolve the requested page. That is why I assumed you might not allocate the memory before the migration. It is a bit confusing that it seems to work with user-managed migrations, are you using the same host software?

@KC-Kevin
Copy link
Author

KC-Kevin commented Jul 8, 2023

Hi,

Thanks for the reply!

I attached the host code and HLS code below for your reference. The host code is modified based on the given SVM example to test single-FPGA cases and multi-FPGA P2P cases with multiple RUNs. The HLS code is able to take in an input array and copy the value into output array. The element in the array is determined by the num_block argument.

If possible, could you please do a test with your system setup to verify the correctness of HLS and host code? It is ready to synthesis and run. Because I am not sure whether the OS kernel version (mine is 6.3.9 -1.el9.elrepo.x86_64 and your is 6.3.2) or any other minor system setup can cause this strange issue or not. The current test is done in commit 0da497f in the develop branch.

In the test for user managed/on demand with a single FPGA or multi-FPGA, I use the table below to record the array size that I hit the error mentioned before

[ 1005.893974] tapasco device #01 [svm_migrate_ram_to_dev]: failed to collect all pages for migration
[ 1005.893984] tapasco device #01 [handle_iommu_page_fault]: error during page fault, disabling IOMMU, please reload bitstream
user managed (single FPGA) on demand (single FPGA) user managed (P2P PCI-E E2E) on demand (P2P PCI-E E2E)
128MB 32 MB 64 MB 16MB

I have two questions:
(1) How do you check the user managed PCI-E endpoint-to-endpoint P2P work? I run the user managed P2P with PCI-E endpoint to endpoint, however, based on the dmesg log attached below, I think the array passed to another device is first copy to the host and copy back to another device. The array has num_blocks=4096, which means the array is 4 pages long. I think there is no issue with PCI-E E2E, because, using on demand P2P, I saw “use PCIe P2P copy” in dmesg log.
(2) How to time the code properly? I think this is related in the Q1. In user managed, when just time the code in the host for launch() function, based on dmesg log, I think it include array copying from the device and back to host. Because, there is no explicitly copying, and the output array can be checked directly with run_interface. Do you have any other recommendation on timing the code? E.g. inserting timing function into pcie_svm.c etc.

Thanks again! Have a great weekend!

Attached file:
host code (should in C++ suffix)
HLS code (should be in interface_test_2.cpp)
HLS kernel.json file

dmesg log: p2p_pcie_user_managed_4page.log

@KC-Kevin
Copy link
Author

Hi,

I also did a test with U280 for single FPGA host setup. The OS/system setup is the same as the previous setup. (only changed the FPGA board)

The table below is the array size where we see dmesg error (attached below) when execute it. The RUN is 10 times, meaning for each array, I repeat the execution 10 time to ensure the stability. The error shows up after 3-7 repetition. So, for example, in single FPGA host, with user managed method, I can run 32 MB array testing for 5 runs, but then I run into the dmesg errors below.

[  439.362084] tapasco device #00 [svm_migrate_ram_to_dev]: failed to collect all pages for migration
[  439.362092] tapasco device #00 [drop_page_fault]: drop page fault: vaddr = 7fe120505000
[  439.362095] tapasco device #00 [handle_iommu_page_fault]: error during page fault, disabling IOMMU, please reload bitstream

The table below is similar to the result obtained in previous reply using U250.

user managed (single FPGA) on demand (single FPGA)
32 MB 2 MB

Any insight/comment will be helpful!

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants