AXI Transactions for DirectDMA (local memory) #359

wirthjohannes · 2023-06-01T12:29:28Z

I did some experiments using the DirectDMA implementation, which is used for transferring data to and from PE-local memory (BRAM).
I used an ILA directly at the PCIe bridge on the FPGA to look into the AXI transactions generated when calling the copy_to (and copy_from) method of DirectDMA for different sizes (64B,128B,192B,256B,320B).
The results differ from what I expect.

Firstly, the AXI transaction sizes where always 32B (on a 64B-wide interface; only the upper or lower half of the strobe bits was set; no bursts). With some further experiments this seems to be the upper bound per transfer here, not sure exactly where this limitation comes from.

But even disregarding this there were other peculiarities: For the transfers >= 128B there were more 32B transactions than required. Looking at the ILA I found that some 32B words were transmitted multiple times.

From my experiments this does not affect correctness, as data is just transferred multiple times to the same address. However this of course still suboptimal, e.g. with regards to performance.

Details

The following tables shows the exact transfers for copy_to calls of different sizes. The left column (for each size) gives the actual transfers, the right what I would have expected

64 Byte	Exp	128 Byte	Exp	192 Byte	Exp	256 Byte	Exp	320 Byte	Exp
0x0	0x0	0x0	0x0	0x0	0x0	0x0	0x0	0x120	0x000
0x20	0x20	0x20	0x20	0x20	0x20	0x20	0x20	0x100	0x020
		0x40	0x40	0x40	0x40	0x40	0x40	0x0e0	0x040
		0x60	0x60	0x60	0x60	0x60	0x60	0x0c0	0x060
		0x60		0xa0	0x80	0xe0	0x80	0x0a0	0x080
		0x40		0x80	0xa0	0xc0	0xa0	0x080	0x0a0
		0x20		0x60		0xa0	0xc0	0x060	0x0c0
		0x00		0x40		0x80	0xe0	0x040	0x0e0
								0x000	0x0100
								0x020	0x120
								0x040
								0x060
								0x120

I also looked into the copy_from method, it behaves identical for up to 256 Bytes. For 320 Bytes (and more) it behaves differently, producing even more read transactions (e.g. 17 read transactions vs. 13 writes transactions for 320 Bytes).

The text was updated successfully, but these errors were encountered:

jahofmann · 2023-06-01T17:32:06Z

The runtime uses AVX/SSE when available. Those registers are 32B/256Bit on most machines. You could try an AVX512 machine to see if you get 64B requests. I'm not aware of a faster way to copy data from the CPU over PCIe, if you don't want to use an on-device DMA engine.

As for the extra requests: No idea where those might come from.

wirthjohannes · 2023-06-02T07:14:50Z

Yes, that makes sense regarding the 32B transfers. Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AXI Transactions for DirectDMA (local memory) #359

AXI Transactions for DirectDMA (local memory) #359

wirthjohannes commented Jun 1, 2023

jahofmann commented Jun 1, 2023 •

edited

wirthjohannes commented Jun 2, 2023

AXI Transactions for DirectDMA (local memory) #359

AXI Transactions for DirectDMA (local memory) #359

Comments

wirthjohannes commented Jun 1, 2023

Details

jahofmann commented Jun 1, 2023 • edited

wirthjohannes commented Jun 2, 2023

jahofmann commented Jun 1, 2023 •

edited