Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: KMD control over sysmem #17

Conversation

joelsmithTT
Copy link
Contributor

@joelsmithTT joelsmithTT commented Apr 25, 2024

This introduces:

  • a mechanism for telling KMD about the hugepage(s) that a device will use for sysmem, intended as a one-time setup step
  • a new mapping, allowing a user application to map the sysmem buffer into its address space without needing to know about hugepages

The goals:

  • Enable system IOMMU
  • Enable sysmem to be partitioned between application and platform use

There is more detail in the commit messages.

This is used to configure address translation from NOC address (x=0,
y=4, 0 <= address <= 0xFFFD_FFFF) to the system bus.
This is used to configure address translation from NOC address (x=0,
y=3, 0x8_0000_0000 <= address <= 0x8_FFFD_FFFF) to the system bus.
@joelsmithTT joelsmithTT force-pushed the joelsmith/tell-kmd-about-hugepages branch from 3321b69 to 3c04dd5 Compare April 25, 2024 19:26
Tensix DMA refers to the ability of Tensix and Ethernet tiles on the SoC
to access the system bus.  This is distinct from DMA performed by the
PCIe IP block's DMA controller.  On Grayskull and Wormhole, there exists
an (almost) 32-bit address range in the PCIe tile for Tensix DMA.

- Grayskull:  0x0_0000_0000 to 0x0_FFFD_FFFF
- Wormhole:   0x8_0000_0000 to 0x8_FFFD_FFFF

Sysmem refers to the buffer(s) in host DRAM that are accessed via Tensix
DMA.  The convention as of Q1 2024 is to allocate one (Grayskull) or
four (Wormhole) 1G hugepages for sysmem, with cooperation beween UMD,
KMD, and ARC firmware to make this memory accessible to hardware.

This scheme has drawbacks:
- Use of the system IOMMU is unsupported.
- UMD manipulates hardware address translation state at runtime,
resulting in a situation where user applications (e.g. AI workloads) and
platform software can not effectively share the Tensix DMA address
space.

Improvement goals:
- Enable use of system IOMMU without requiring IOMMU.
- Manage sysmem buffer(s) in the driver, in a NUMA-aware manner.

Initial work allocated the buffer in the kernel using the DMA subsystem,
without the need for huge pages.  With IOMMU enabled,
`dma_alloc_coherent` allocates IO virtually contiguous memory.  With
IOMMU disabled, CMA configured at boot allows `dma_alloc_coherent` to
allocate physically contiguous memory.  Even with large (i.e. 4GiB)
buffers, both scenarios work as expected under Linux 6.5.

Unfortunately, this in-kernel approach to allocation does not work under
older kernels such as 5.4 (Ubuntu 20.04) and 5.15 (Ubuntu 22.04).
Experiments with a carveout approach (pass memmap to the kernel at boot)
were similarly unsuccessful on such older kernels with IOMMU enabled.

The fallback technique implemented in this commit is to tell the driver
about each 1G hugepage via an ioctl.  This can be regarded as a
one-time, per-device setup step.  When the ioctl is invoked, the driver:
- Checks that the page(s) are 1G huge
- Pins the page(s)
- Configures hardware address translation and IOMMU (if necessary)
- Tracks the page(s) so that user software can mmap a sysmem buffer
- Hangs onto the pages until the ioctl is called again or the driver is
unloaded

The driver therefore provides both hardware and software with the
illusion that the sysmem for a device is contiguous.

With the exception of this setup step, userspace is expected to access
the sysmem buffer via the driver, rather than interacting with the
hugetlbfs.

Abandoning support for older kernels will allow the setup step to be
omitted, as the driver will be able to handle allocation itself.
First, the test code performs hugepage setup:
- Allocates 1G hugepages (one for GS; four for WH)
- Tells the driver about the hugepages
- Fills the hugepages with a pattern, then unmaps them
Now the driver owns the page(s) and hardware has been configured to have
access to them.

Next, the test code:
- Maps the sysmem buffer into its address space
- Checks that the pattern is still valid

Finally, the test code:
- Uses the device to read random addresses in the buffer
- Writes to random addresses in the buffer
In both cases, the resulting state is checked for validity.
@joelsmithTT joelsmithTT force-pushed the joelsmith/tell-kmd-about-hugepages branch from 3c04dd5 to b038417 Compare April 28, 2024 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant