Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add infrastructure to debug deadlocks of the AXI connections #234

Open
jahofmann opened this issue Jul 30, 2020 · 9 comments
Open

Add infrastructure to debug deadlocks of the AXI connections #234

jahofmann opened this issue Jul 30, 2020 · 9 comments
Labels
bug Something isn't working Feature Request good first issue Good for newcomers

Comments

@jahofmann
Copy link
Contributor

Right now it is very easy to deadlock the whole system if a PE e.g. reads from the DDR and writes to the DDR but never deals with the result of either request. Right now such a situation is quite hard to debug. Most often this scenario happens when the DMA engine is active at the same time.

Right now I have to possible solutions:

  • Add infrastructure to the AXI ports of a PE that counts the number of inflight requests and the number of cycles that the response channels block. This way it is quite easy to see which PE fails to handle AXI properly.
  • Add a buffer inbetween the PE and the memory that can store the required number of read beats to make sure that reads can always complete. In the write direction this buffer would wait until all write beats have been received from the PE before forwarding it to the upstream memory.
@jahofmann jahofmann added bug Something isn't working Feature Request labels Jul 30, 2020
@jahofmann jahofmann added this to the 2020.10 milestone Jul 30, 2020
@jahofmann jahofmann added the good first issue Good for newcomers label Oct 8, 2020
@jahofmann jahofmann modified the milestones: 2020.10, 2021.4 Oct 8, 2020
@cahz cahz modified the milestones: 2021.4, 2021.10 Apr 7, 2021
@forflo
Copy link

forflo commented Jul 28, 2021

I just ran into this issue with a stencil PE synthesized by Vivado HLS that deadlocks the whole system.

The code in question looks like this:

void tlf(float *buf1, float *buf2, ...) {
  for (int i = 0; ...) {
    for (int j = 0; ...) {
      for (int k = 0; ...) {
      buf1[calcIdx(i,j,k)] = buf2[calcIdx(i,j,k)] + buf2[...] + buf3 + ...;
      }
    }
  }
}

Am I using the HLS wrong (i.e., is this a case of RTFM) or ist it a bug in Vivado HLS that it generates AXI master bus requests that deadlock the system?

If so, could you reference some information about how to workaround this issue?

@sommerlukas
Copy link
Member

Hi @forflo,

so far we have mainly observed this behavior with two HDL-implemented PEs. Usually, the problem is that one PE has an ongoing write request (e.g., AXI4 burst write) that it cannot complete until it reads further elements. At the same time, another PE has an ongoing read request (e.g., a burst read) that it cannot complete before it can store some results. This situation then leads to a deadlock, as neither PE can resume operation and complete the ongoing request.

For your case: Have you tried to (partially) buffer the inputs in some internal array (= on-chip memory) and see if the problem persists? How many PEs do you have in your design?

You should also make sure to not read or write past the boundaries of your input/output arrays, this can also cause the memory interface to stop operating.

@forflo
Copy link

forflo commented Jul 29, 2021

Hi @sommerlukas,

thank you for your quick reply.

How many PEs do you have in your design?

Just one, syntheisized with Vivado HLS 2018.2 used in a TaPaSCo composition (current master). I also tried Vivado HLS 2020.1 with the same behavior.

You should also make sure to not read or write past the boundaries of your

I know that this could have been the reason for the problem, so I first tried to find out-of-bound accesses with valgrind. I found none and tried again with compiler instrumentation -fsanitize=address. I also inserted hand-written in-bound access assertions for all index calculations (i.e., assert(indexCalc(...) < size && indexCalc >=0). I did not find any out-of-bounds accesses and the number of bytes in the both buffers I allocate on the FPGA board matches those bounds perfectly.

In the Vivado HLS log file, I noticed that it infers an AXI burst for the writes. Due to the structure of the 3D stencil, the write accesses are sequential while the reads are not. So what you describe by

an ongoing write request (e.g., AXI4 burst write) that it cannot complete until it reads further elements

might actually be the behavior of my IP.

Can it be an issue that TaPaSCo instructs the HLS to use a different port (-bundle) for each AXI master interface? From the standpoint of the AXI bus, my single IP would act like two PEs. One that reads, and one that writes.

Is there a flag in TaPaSCo to bundle ports?

@sommerlukas
Copy link
Member

Hi @forflo,

I think for the kind of deadlock we mainly had in mind for this issue, one would need two write ports. From the code snippet above there only seems to be one write port. At least up to the MIG, read and write are separate signals on the AXI bus, so they should not interfere with each other, although it's hard to definitely tell what the Xilinx MIG will make of that.

You could try to see if this behavior is caused by merging the ports by removing the -bundle option from the HLS TCL template.

If you need to dig deeper, you could also give TaPaSCo's debug feature a try, as described here. By attaching an ILA to the master of the PE, you should be able to see the AXI transactions and maybe find the cause of the deadlock. Usage of the feature might be easier through a Job file than through CLI.

@wirthjohannes
Copy link
Collaborator

Hi @forflo,
just a quck addition to what Lukas wrote:
Some time ago I had a similar issue where a write request started sending some data beats before being interrupted and then having to wait for a DMA transfer. The DMA transfer however would not execute, as the MIG was blocked by the earlier write. Unfortunately I don't remember the direction of the DMA transfer anymore and can thus not say if it resulted in a second write request (which would definitely be problem as Lukas pointed out) or a read request.
However you should definitely try out an ILA as suggested, this helped me to find the exact issue

@forflo
Copy link

forflo commented Jul 30, 2021

Thank you @wirthjohannes and @sommerlukas for your help. I will investigate that issue further and will add some of my insights here.

@forflo
Copy link

forflo commented Aug 6, 2021

Okay, short summary what I have found out so far:

In multi port designs, such as void tlf(float *buf1, float *buf2), where both ports are bundled onto an AXI master port, the HLS [1] creates a problematic schedule of AXI write signals. For the problematic code mentioned previously, the schedule shown by the screenshot below gets generated. In the screenshot, you can see that the write request and the write response is scheduled a long time before the data to be written is ready. Only the actual write signal (far to the right and not visible) is scheduled after the last calculation of the assignment has finished . Since I still have to investigate this further, I am not sure whether it is a valid schedule or not, but I am now sure that it causes the dead-lock in TaPaSCo.

I modified the code and forced the three events writereq, write, and writeresp (in this order) to the end of the instruction chain (which did not change otherwise). The new schedule does not dead-lock the hardware.

[1]: Vivado HLS 2018.2

Screenshot

@sommerlukas
Copy link
Member

Hi @forflo, thanks for investigating this, very interesting insights!

Just out of curiosity and maybe also as a future reference for other users: How did you enforce the write-request to be scheduled at the end of the instruction sequence, via a pragma or via TCL?

@forflo
Copy link

forflo commented Aug 6, 2021

I first tried out pragmas and various TCL commands but sadly none of them had the desired effect.

Then I replaced array subscripts by calls to these two functions:

void my_write(volatile float *A, int i, float val) { A[i] = val; }
float my_read(volatile float *A, int i) { return A[i]; }

That is, I converted something like A[i] = A[i] + B[i] into my_write(A, i, my_read(A, i) + my_read(B, i)), which completely sequentializes all memory accesses. my_write(A, i, A[i] + B[i]) works too though (and you only loose pipelining for the write access).

@cahz cahz modified the milestones: 2022.1, 2022.2 May 17, 2022
@cahz cahz removed this from the 2024.1 milestone Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Feature Request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants