Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AXI Transactions for DirectDMA (local memory) #359

Open
wirthjohannes opened this issue Jun 1, 2023 · 2 comments
Open

AXI Transactions for DirectDMA (local memory) #359

wirthjohannes opened this issue Jun 1, 2023 · 2 comments

Comments

@wirthjohannes
Copy link
Collaborator

I did some experiments using the DirectDMA implementation, which is used for transferring data to and from PE-local memory (BRAM).
I used an ILA directly at the PCIe bridge on the FPGA to look into the AXI transactions generated when calling the copy_to (and copy_from) method of DirectDMA for different sizes (64B,128B,192B,256B,320B).
The results differ from what I expect.

Firstly, the AXI transaction sizes where always 32B (on a 64B-wide interface; only the upper or lower half of the strobe bits was set; no bursts). With some further experiments this seems to be the upper bound per transfer here, not sure exactly where this limitation comes from.

But even disregarding this there were other peculiarities: For the transfers >= 128B there were more 32B transactions than required. Looking at the ILA I found that some 32B words were transmitted multiple times.

From my experiments this does not affect correctness, as data is just transferred multiple times to the same address. However this of course still suboptimal, e.g. with regards to performance.

Details

The following tables shows the exact transfers for copy_to calls of different sizes. The left column (for each size) gives the actual transfers, the right what I would have expected

64 Byte Exp 128 Byte Exp 192 Byte Exp 256 Byte Exp 320 Byte Exp
0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x120 0x000
0x20 0x20 0x20 0x20 0x20 0x20 0x20 0x20 0x100 0x020
    0x40 0x40 0x40 0x40 0x40 0x40 0x0e0 0x040
    0x60 0x60 0x60 0x60 0x60 0x60 0x0c0 0x060
    0x60   0xa0 0x80 0xe0 0x80 0x0a0 0x080
    0x40   0x80 0xa0 0xc0 0xa0 0x080 0x0a0
    0x20   0x60   0xa0 0xc0 0x060 0x0c0
    0x00   0x40   0x80 0xe0 0x040 0x0e0
                0x000 0x0100
                0x020 0x120
                0x040  
                0x060  
                0x120  

I also looked into the copy_from method, it behaves identical for up to 256 Bytes. For 320 Bytes (and more) it behaves differently, producing even more read transactions (e.g. 17 read transactions vs. 13 writes transactions for 320 Bytes).

@jahofmann
Copy link
Contributor

jahofmann commented Jun 1, 2023

The runtime uses AVX/SSE when available. Those registers are 32B/256Bit on most machines. You could try an AVX512 machine to see if you get 64B requests. I'm not aware of a faster way to copy data from the CPU over PCIe, if you don't want to use an on-device DMA engine.

As for the extra requests: No idea where those might come from.

@wirthjohannes
Copy link
Collaborator Author

Yes, that makes sense regarding the 32B transfers. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants