Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cxl list, no matching device found #246

Open
alexisfrjp opened this issue May 11, 2023 · 19 comments
Open

cxl list, no matching device found #246

alexisfrjp opened this issue May 11, 2023 · 19 comments

Comments

@alexisfrjp
Copy link

I have a CXL Type3 device plugged into the server.
I can see it with lspci but not with cxl-cli.

$ sudo lspci -vvv -s 70:00.0
70:00.0 CXL: Intel Corporation Device 0ddb (rev 01) (prog-if 10 [CXL Memory Device (CXL 2.x)])
	Subsystem: Intel Corporation Device 0000
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 0
	Region 0: Memory at a3200000 (32-bit, non-prefetchable) [size=2M]
	Region 2: Memory at 2080400000 (64-bit, prefetchable) [size=128K]
	Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0
			ExtTag+ RBE+ FLReset+
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 512 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp+ 10BitTagReq+ OBFF Via WAKE#, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS+
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
	Capabilities: [a0] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC+ UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES- TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 04080001 fc00000f 70000124 00000000
	Capabilities: [180 v1] Extended Capability ID 0x2f
	Capabilities: [588 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [5b0 v1] Transaction Processing Hints
		Extended requester support
		Steering table in TPH capability structure
	Capabilities: [714 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [900 v1] Extended Capability ID 0x2b
	Capabilities: [a00 v1] Data Link Feature <?>
	Capabilities: [a20 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [a80 v1] Extended Capability ID 0x2a
	Capabilities: [ac0 v1] Lane Margining at the Receiver <?>
	Capabilities: [b20 v1] Page Request Interface (PRI)
		PRICtl: Enable- Reset-
		PRISta: RF- UPRGI- Stopped+
		Page Request Capacity: 00000010, Page Request Allocation: 00000000
	Capabilities: [b40 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 14
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [c00 v1] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [c10 v1] Designated Vendor-Specific: Vendor=1e98 ID=0007 Rev=0 Len=20 <?>
	Capabilities: [d70 v1] Designated Vendor-Specific: Vendor=1e98 ID=0008 Rev=0 Len=36 <?>
	Capabilities: [e00 v1] Designated Vendor-Specific: Vendor=8086 ID=0050 Rev=0 Len=12 <?>
	Capabilities: [f00 v1] Designated Vendor-Specific: Vendor=1e98 ID=0000 Rev=1 Len=56: CXL
		CXLCap:	Cache- IO+ Mem+ Mem HW Init+ HDMCount 1 Viral+
		CXLCtl:	Cache- IO+ Mem- Cache SF Cov 0 Cache SF Gran 0 Cache Clean- Viral-
		CXLSta:	Viral-
	Capabilities: [f60 v1] Designated Vendor-Specific: Vendor=1e98 ID=000a Rev=0 Len=34 <?>
$ sudo cxl list -vvv
  Warning: no matching devices found

[
]

I'm very new to the CXL world and it seems a bit messy, all the searches I do are unclear.

@sscargal
Copy link

sscargal commented May 11, 2023

@alexisfrjp

The cause of the issue is because there's no CXL device found in sysfs. The common causes of this could be:

  1. Your OS/Kernel doesn't support CXL (Kernel 6.3 or later has support for Volatile CXL devices)
  2. Your BIOS is set to present CXL as a memory device (DRAM) rather than as a Special Purpose memory device. cxl-cli can manages special purpose memory devices only. Intel Saphire Rapids defaults to special purpose whereas AMD defaults to treating CXL devices as DRAM, hence cxl list doesn't show the devices.
  3. Your CPU/BIOS doesn't support Type 3 devices. Intel Sapphire Rapids doesn't officially support Type 3 devices, only Type 2. Depending on your server and BIOS, you may find a 'Type 3 Legacy' option that uses Type2 flows with Type 3 devices. AMD Genoa supports Type3 CXL devices only, but defaults to treating them as normal RAM, not special purpose. There is a BIOS option to change this behaviour.

Can you provide the following info, please:

  • What version of cxl are you using? (cxl version). The utility version must support the Kernel version.
  • What server make/model do you have? (dmidecode -t baseboard)
  • What CPU do you have? (lscpu)
  • What Linux distro are you using? (cat /etc/os-release)
  • What Kernel version are you using? (uname -r)
  • What does numactl -H show? Do you see the cpu-less NUMA node for the CXL device?
  • What does the BIOS e820 table show during boot? (dmesg | grep e820). You ideally want to see a 'reserved' entry for the CXL memory. If you see 'usable' for the CXL memory range, then the BIOS/Platform is setup to not treat CXL as Special Purpose memory, hence the cxl list message.

@alexisfrjp
Copy link
Author

alexisfrjp commented May 29, 2023

$ cxl version
77
$ sudo dmidecode -t baseboard
# dmidecode 3.4
Getting SMBIOS data from sysfs.
SMBIOS 3.5.0 present.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
	Manufacturer: ASUSTeK COMPUTER INC.
	Product Name: Pro WS W790E-SAGE SE
	Version: Rev 1.xx
	Serial Number: 230216159100164
	Asset Tag: Default string
	Features:
		Board is a hosting board
		Board is replaceable
	Location In Chassis: Default string
	Chassis Handle: 0x0003
	Type: Motherboard
	Contained Object Handles: 0
Intel(R) Xeon(R) w9-3495X
$ cat /etc/os-release
NAME="Fedora Linux"
VERSION="38 (Server Edition)"
ID=fedora
VERSION_ID=38
VERSION_CODENAME=""
PLATFORM_ID="platform:f38"
PRETTY_NAME="Fedora Linux 38 (Server Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:38"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f38/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=38
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=38
SUPPORT_END=2024-05-14
VARIANT="Server Edition"
VARIANT_ID=server
$ uname -r
6.3.4-350.vanilla.fc38.x86_64
$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 0 size: 128383 MB
node 0 free: 126151 MB
node 1 cpus:
node 1 size: 8053 MB
node 1 free: 7929 MB
node distances:
node   0   1 
  0:  10  14 
  1:  14  10
$ dmesg | grep e820
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009efff] reserved
[    0.000000] BIOS-e820: [mem 0x000000000009f000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000610cdfff] usable
[    0.000000] BIOS-e820: [mem 0x00000000610ce000-0x00000000661bafff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000661bb000-0x0000000066abafff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000066abb000-0x000000006945efff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000006945f000-0x000000006f7fefff] reserved
[    0.000000] BIOS-e820: [mem 0x000000006f7ff000-0x000000006f7fffff] usable
[    0.000000] BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fe010000-0x00000000fe010fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000227fffffff] usable
[    0.000000] e820: update [mem 0x574e7018-0x574ef057] usable ==> usable
[    0.000000] e820: update [mem 0x574e7018-0x574ef057] usable ==> usable
[    0.000000] e820: update [mem 0x54c6a018-0x54ca1a57] usable ==> usable
[    0.000000] e820: update [mem 0x54c6a018-0x54ca1a57] usable ==> usable
[    0.000000] e820: update [mem 0x54c32018-0x54c69a57] usable ==> usable
[    0.000000] e820: update [mem 0x54c32018-0x54c69a57] usable ==> usable
[    0.000000] efi: Remove mem141: MMIO range=[0x80000000-0x8fffffff] (256MB) from e820 map
[    0.000000] e820: remove [mem 0x80000000-0x8fffffff] reserved
[    0.000000] efi: Not removing mem142: MMIO range=[0xfe010000-0xfe010fff] (4KB) from e820 map
[    0.000000] efi: Remove mem143: MMIO range=[0xff000000-0xffffffff] (16MB) from e820 map
[    0.000000] e820: remove [mem 0xff000000-0xffffffff] reserved
[    0.000016] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000018] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000836] e820: update [mem 0x7e000000-0xffffffff] usable ==> reserved
[    0.007321] e820: update [mem 0x5ea13000-0x5ea13fff] usable ==> reserved
[    0.225445] e820: update [mem 0x5b27e000-0x5b2c3fff] usable ==> reserved
[    7.169767] e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]
[    7.169769] e820: reserve RAM buffer [mem 0x54c32018-0x57ffffff]
[    7.169770] e820: reserve RAM buffer [mem 0x54c6a018-0x57ffffff]
[    7.169771] e820: reserve RAM buffer [mem 0x574e7018-0x57ffffff]
[    7.169771] e820: reserve RAM buffer [mem 0x5b27e000-0x5bffffff]
[    7.169772] e820: reserve RAM buffer [mem 0x5ea13000-0x5fffffff]
[    7.169773] e820: reserve RAM buffer [mem 0x610ce000-0x63ffffff]
[    7.169773] e820: reserve RAM buffer [mem 0x6f800000-0x6fffffff]
$ sudo lspci -vvv -s 70:00.0
70:00.0 CXL: Intel Corporation Device 0ddb (rev 01) (prog-if 10 [CXL Memory Device (CXL 2.x)])
	Subsystem: Intel Corporation Device 0000
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 0
	Region 0: Memory at b4a00000 (32-bit, non-prefetchable) [size=2M]
	Region 2: Memory at 2280400000 (64-bit, prefetchable) [size=128K]
	Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0
			ExtTag+ RBE+ FLReset+
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 512 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp+ 10BitTagReq+ OBFF Via WAKE#, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS+
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq+ OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
	Capabilities: [a0] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC+ UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES- TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 04080001 fc00000f 70000124 00000000
	Capabilities: [180 v1] Extended Capability ID 0x2f
	Capabilities: [588 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [5b0 v1] Transaction Processing Hints
		Extended requester support
		Steering table in TPH capability structure
	Capabilities: [714 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [900 v1] Extended Capability ID 0x2b
	Capabilities: [a00 v1] Data Link Feature <?>
	Capabilities: [a20 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [a80 v1] Extended Capability ID 0x2a
	Capabilities: [ac0 v1] Lane Margining at the Receiver <?>
	Capabilities: [b20 v1] Page Request Interface (PRI)
		PRICtl: Enable- Reset-
		PRISta: RF- UPRGI- Stopped+
		Page Request Capacity: 00000010, Page Request Allocation: 00000000
	Capabilities: [b40 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 14
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [c00 v1] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [c10 v1] Designated Vendor-Specific: Vendor=1e98 ID=0007 Rev=0 Len=20: CXL
		Revision 0 not supported
	Capabilities: [d70 v1] Designated Vendor-Specific: Vendor=1e98 ID=0008 Rev=0 Len=36: CXL
		Block1: BIR: bar0, ID: component registers, offset: 0000000000150000
		Block2: BIR: bar0, ID: CXL device registers, offset: 0000000000180000
	Capabilities: [e00 v1] Designated Vendor-Specific: Vendor=8086 ID=0050 Rev=0 Len=12 <?>
	Capabilities: [f00 v1] Designated Vendor-Specific: Vendor=1e98 ID=0000 Rev=1 Len=56: CXL
		CXLCap:	Cache- IO+ Mem+ Mem HW Init+ HDMCount 1 Viral+
		CXLCtl:	Cache- IO+ Mem+ Cache SF Cov 0 Cache SF Gran 0 Cache Clean- Viral-
		CXLSta:	Viral-
		CXLSta2:	ResetComplete- ResetError- PMComplete-
		Cache Size Not Reported
		Range1: 0000002080000000-000000227fffffff
			Valid+ Active+ Type=Volatile Class=DRAM interleave=0 timeout=4s
		Range2: 0000000000000000-ffffffffffffffff
			Valid- Active- Type=Volatile Class=DRAM interleave=0 timeout=1s
	Capabilities: [f60 v1] Designated Vendor-Specific: Vendor=1e98 ID=000a Rev=0 Len=34: CXL
		Unknown ID 000a
	Kernel modules: cxl_pci

Is it enough? Since the device is an FPGA, I see some traffic around every second.

The BIOS has the Type3 legacy enabled.

@sscargal
Copy link

@alexisfrjp Thanks for the info. You have everything needed from the Kernel and cxl-cli perspective to work with CXL Type 3 devices. However, since cxl list -vvv doesn't report any devices, the Kernel isn't detecting any supported hardware devices.

Officially, Intel doesn't support Type 3 in the Sapphire Rapids time frame, only Type 2 is supported. Unofficially, if the BIOS vendors kept the 'Type 3 Legacy Mode' BIOS feature, CXL devices do work on the platforms I'm using (Supermicro with Xeon Gold & Platinum CPUs).

Q) Do you see any DeviceDAX entries under /dev/daxX.Y - where X.Y is a major.instance nomenclature. The first device could/should be /dev/dax0.0.

Q) Does daxctl list -iDMR show any devices? (It'll report those devices, if any, under /dev/dax*)

If YES to the above, then you need to convert the 'devdax' namespace to a 'system-ram' type and it'll show a new cpu-less/memory-only NUMA node. Use:

sudo daxctl reconfigure-device --mode=system-ram --no-online all

You can read my blog entry on how this should work - How To Extend Volatile System Memory (RAM) using Persistent Memory on Linux. Ignore the early part of the blog as it's outdated. Also, ignore the references to PMem. CXL Type 3 works the same as PMem in this use case.

Q) Do you see any entries in /dev/cxl/mem*? If not, this is why cxl list isn't reporting anything.

Q) Do you see any other pci or cxl related messages in dmesg?

If the above actions don't yield much info, the next steps are

  1. Check with the FPGA Bitstream vendor to validate that the image is working as expected or if they recommend any BIOS options.
  2. Check with ASUS if their BIOS really does implement Type 3 or is it a dummy placeholder with no working code behind it. (This is a vendor-specific choice and implementation detail given Intel's decision not officially to support Type 3 yet).
  3. Check with Intel that your CPU SKU supports CXL.

@alexisfrjp
Copy link
Author

alexisfrjp commented May 30, 2023

Thanks for your reply @sscargal !

  • My FPGA device is an example design of the Agilex-I series Intel CXL IP core
    • it's just an empty, generic CXL device exposing some memory with no other function.
  • It is not Intel Optane Persistent Memory

@alexisfrjp Thanks for the info. You have everything needed from the Kernel and cxl-cli perspective to work with CXL Type 3 devices. However, since cxl list -vvv doesn't report any devices, the Kernel isn't detecting any supported hardware devices.

But lspci and numactl show the CXL device was detected. And the device has been probed by cxl_pci` driver.

Officially, Intel doesn't support Type 3 in the Sapphire Rapids time frame, only Type 2 is supported. Unofficially, if the BIOS vendors kept the 'Type 3 Legacy Mode' BIOS feature, CXL devices do work on the platforms I'm using (Supermicro with Xeon Gold & Platinum CPUs).

Exactly. It's been confirmed by Intel, my workstation CPU supports CXL, and I got an unofficial BIOS from Asus enabling the "Type3 legacy" BIOS feature.
Without this BIOS, Mem wasn't enabled in lspci. With the BIOS, Mem+ is enabled in lspci and numactl detects it.

I also have the project for the Type2, the result is the same.

Q) Do you see any DeviceDAX entries under /dev/daxX.Y - where X.Y is a major.instance nomenclature. The first device could/should be /dev/dax0.0.

There is no /dev/dax* files.
Isn't DAX for filesystems?

Q) Does daxctl list -iDMR show any devices? (It'll report those devices, if any, under /dev/dax*)

Nothing...

Q) Do you see any entries in /dev/cxl/mem*? If not, this is why cxl list isn't reporting anything.

Nothing...

Q) Do you see any other pci or cxl related messages in dmesg?

$ dmesg |grep cxl
[   60.267893] cxl_pci 0000:70:00.0: can't derive routing for PCI INT A
[   60.267894] cxl_pci 0000:70:00.1: enabling device (0140 -> 0142)
[   60.267896] cxl_pci 0000:70:00.0: PCI INT A: no GSI
[   60.267918] cxl_pci 0000:70:00.1: Device DVSEC not present, skip CXL.mem init

If the above actions don't yield much info, the next steps are

1. Check with the FPGA Bitstream vendor to validate that the image is working as expected or if they recommend any BIOS options.

I have the source-code of the project, it was validated and all the BIOS options are enabled.

2. Check with ASUS if their BIOS really does implement Type 3 or is it a dummy placeholder with no working code behind it. (This is a vendor-specific choice and implementation detail given Intel's decision not officially to support Type 3 yet).

The special BIOS version has been provided by Intel (provided by Asus).

3. Check with Intel that your CPU SKU supports CXL.

It has been confirmed it does.


All the CXL workflow is excessively unclear. It's a mix of everything and nothing at the same time.
I don't even know what Persistent Memory has anything to do with CXL.

It's only a use case, not what CXL is. My understanding is CXL, like PCIe, just provides a memory bus to any memory, it can be DRAM, flash, Persistent Memory, can even just be registers (seen as flip-flop from FPGA).

I just have an CXL.Mem enabled device and I'd like to map the memory space so that I can use it from the user-space. Why is it so complicated...

All the tools give different results, lspci and numactl tell me it's detected and enabled. I don't even know why the kernel module cxl_pci is loaded for my device, I'm pretty sure the VendorID and DeviceID aren't part of this driver. cxl list and all (ndctl's tools) don't even find my device.

Thanks for your help!

Edit: Let me know if I'm in the wrong project for my use-case.

@alexisfrjp
Copy link
Author

Which devices cxl-cli should detect? Which memory? CXL.io BAR or CXL.Mem?

@sscargal
Copy link

Which devices cxl-cli should detect? Which memory? CXL.io BAR or CXL.Mem?

CXL.mem

@sscargal
Copy link

sscargal commented May 30, 2023

If we look back at the numactl -H output, we do see a memory-only/cpu-less NUMA node (Node 1) which is your CXL.mem device. Are you mapping 8GB from the FPGA?

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 0 size: 128383 MB <<<<<< DRAM
node 0 free: 126151 MB
node 1 cpus:  <<< Note there's no CPUs listed here (this is good)
node 1 size: 8053 MB <<<<<< CXL
node 1 free: 7929 MB
node distances:
node   0   1 
  0:  10  14 
  1:  14  10.  <<<<<< CXL

Since you're not getting the /dev/cxl/mem and corresponding /dev/dax0.0 device, this happens when the CXL device isn't exposing itself as Special Purpose (SP) to UEFI (EFI_MEMORY_SP), or the BIOS is ignoring it. Special Purpose memory devices will have a 'soft reserved' e820. The Kernel will then provision the CXL.mem as a memdev with a corresponding devdax device that applications can use directly, or it can be converted to a system-ram namespace type, and it'll appear as a memory-only numa node (like your output above). See the EFI Specific Purpose Memory Support Kernel patchset for more info. Kernel 6.3 has this feature, but make sure it's enabled by default. Look at the Kernel config options in /boot/config-$(uname -r). The distro gets to decide the CONFIG_EFI_SOFT_RESERVE policy, and if it chooses CONFIG_EFI_SOFT_RESERVE=y. This presentation gives a high-level overview of the flows.

For non-Special Purpose memory devices, the BIOS/UEFI maps the memory as 'usable,' and the Kernel treats the CXL memory as main memory (DRAM), meaning it's put into the ZONE_NORMAL zone and used as main memory with no user control (cxl-cli/daxctl). For SP memory, you should see 'ZONE_MOVABLE' for your CXL memory.

You can look at the memory zones using lsmem -o+ZONES,NODE.

The next obvious step is to look at the ACPI tables to see what you've configured in the FPGA.

I would also look at resolving this kernel warning

[   60.267918] cxl_pci 0000:70:00.1: Device DVSEC not present, skip CXL.mem init

The CXL 2.0 spec defines several DVSEC bits as mandatory. Could you check which ones you've set and are missing that will fix that warning? If the kernel driver isn't initialized, this would explain some of the Kernel behavior.

@alexisfrjp
Copy link
Author

alexisfrjp commented May 31, 2023

If we look back at the numactl -H output, we do see a memory-only/cpu-less NUMA node (Node 1) which is your CXL.mem device. Are you mapping 8GB from the FPGA?

Exactly, I'm mapping 8GB from the FPGA via CXL.Mem.
The Intel Memory Latency Checker (MLC) gives me results on this memory-only numa node and I can see the R/W traffic (using a logic analyzer in the FPGA).

Since you're not getting the /dev/cxl/mem and corresponding /dev/dax0.0 device, this happens when the CXL device isn't exposing itself as Special Purpose (SP) to UEFI (EFI_MEMORY_SP), or the BIOS is ignoring it. Special Purpose memory devices will have a 'soft reserved' e820. The Kernel will then provision the CXL.mem as a memdev with a corresponding devdax device that applications can use directly, or it can be converted to a system-ram namespace type, and it'll appear as a memory-only numa node (like your output above). See the EFI Specific Purpose Memory Support Kernel patchset for more info. Kernel 6.3 has this feature, but make sure it's enabled by default. Look at the Kernel config options in /boot/config-$(uname -r). The distro gets to decide the CONFIG_EFI_SOFT_RESERVE policy, and if it chooses CONFIG_EFI_SOFT_RESERVE=y. This presentation gives a high-level overview of the flows.

$ cat /boot/config-$(uname -r) | grep SOFT_RESERVE
CONFIG_EFI_SOFT_RESERVE=y

Nevertheless, the user-guide I'm following for the CXL example design mentions the use of efi=nosoftreserve (+cf. the kernel patch link)

$ cat /proc/cmdline 
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.3.4-350.vanilla.fc38.x86_64 root=UUID=ebbbb014-e320-4361-b90c-e70f519a0fee ro rhgb quiet efi=nosoftreserve nopat

Moreover, thank you for the links but like lots of CXL-related documentations, there are lots of acronyms and they target BIOS/UEFI/Kernel experts, not engineers who just want to work with a CXL.Mem device.

For non-Special Purpose memory devices, the BIOS/UEFI maps the memory as 'usable,' and the Kernel treats the CXL memory as main memory (DRAM), meaning it's put into the ZONE_NORMAL zone and used as main memory with no user control (cxl-cli/daxctl). For SP memory, you should see 'ZONE_MOVABLE' for your CXL memory.

You can look at the memory zones using lsmem -o+ZONES,NODE.

$ lsmem -o+ZONES,NODE
RANGE                                  SIZE  STATE REMOVABLE BLOCK  ZONES NODE
0x0000000000000000-0x000000007fffffff    2G online       yes     0   None    0
0x0000000100000000-0x000000207fffffff  126G online       yes  2-64 Normal    0
0x0000002080000000-0x000000227fffffff    8G online       yes 65-68 Normal    1

Memory block size:         2G
Total online memory:     136G
Total offline memory:      0B

It looks like it's system-ram only, and what I should like to have is SP, correct?

I will try to compile the same kernel with CONFIG_EFI_SOFT_RESERVE=n.

The next obvious step is to look at the ACPI tables to see what you've configured in the FPGA.

I will have a look at it.

I would also look at resolving this kernel warning

[   60.267918] cxl_pci 0000:70:00.1: Device DVSEC not present, skip CXL.mem init

The CXL 2.0 spec defines several DVSEC bits as mandatory. Could you check which ones you've set and are missing that will fix that warning? If the kernel driver isn't initialized, this would explain some of the Kernel behavior.

Indeed, I will open a ticket at Intel. They advertised the CXL IP core as CXL 2.0.

I know understand why I can see memory traffic in the fpga even though nothing is running; it is actually the kernel using it as system-ram. I can even see it in top/htop.

Thank you very much, it has been very fruitful and I have more directions to investigate.

@sscargal
Copy link

Thanks for the update.

It looks like it's system-ram only, and what I should like to have is SP, correct?

Correct. The FPGA is working correctly in the non-SP mode, which is fine for some use cases, but you have no control or management of the CXL device. As you observed, the Kernel is free to randomly allocate memory from DRAM or CXL which is not what most people want (IMHO). As an aside note, what you have now is the default behavior on AMD Genoa servers, and there's a BIOS option to switch CXL to be SP memory - like Intel. Intel's default behavior is to treat CXL as special purpose memory and manage it with the cxl and daxctl utilities. Having the memdev and devdax gives us control over what application(s) use CXL, when, and how.

I recommend removing the efi=nosoftreserve Kernel boot option and see what happens as this is a blocker to your desired goal. There's no need to recompile the Kernel to set CONFIG_EFI_SOFT_RESERVE=n as that would force all CXL to be non-SP memory.

Lastly, there are some notes on CXL Kernel development here that you will find useful in your CXL journey.

@alexisfrjp
Copy link
Author

Thank you Steve!

After removing the efi=nosoftreserve, I can now see the /dev/cxl and /dev/dax device files.

Regarding the dmesg warning :

[ 60.267918] cxl_pci 0000:70:00.1: Device DVSEC not present, skip CXL.mem init

https://github.com/torvalds/linux/blob/929ed21dfdb6ee94391db51c9eedb63314ef6847/drivers/cxl/pci.c#L681
PCI_DVSEC_VENDOR_ID_CXL=1e98 and CXL_DVSEC_PCIE_DEVICE=0
but my device exposes the CXL's DVSEC Capabilities: [f00 v1] Designated Vendor-Specific: Vendor=1e98 ID=0000 Rev=1 Len=56: CXL.
Therefore, I think there is a mistake in pci_find_dvsec_capability() or the cxl_pci code.

Is cxl_pci just an example driver for CXL (to write our own custom driver) or an ultimate driver to always be used for all the CXL devices?
I'll soon start porting our PCIe driver using CXL.io first in a transparent manner, then CXL.Mem.
And why not (haven't checked yet) remap_pfn_range() or similar so that user-space can mmap() the HDM memory from user-space without a global /dev/mem /dev/dax files. Nevertheless, I still don't exactly know what would be the best/common practices to use CXL.Mem memory.

Thank you for this link, it'll definitely be useful! And for your help, it's now much more clear and I will keep reading resources and stay tuned!

@sscargal
Copy link

sscargal commented Jun 1, 2023

After removing the efi=nosoftreserve, I can now see the /dev/cxl and /dev/dax device files.

Great progress! Thanks for the update.

Hopefully, this command will allow you to convert the devdax device into a system-ram and see the new NUMA node that applications can use. It'll look like your first numactl -H output.

sudo daxctl reconfigure-device --mode=system-ram --no-online all

Note: This is not persistent across reboots, so you'll need to write an /etc/init.d script to run it on each boot - if that's what you want/need.

With the NUMA node established, you can use numactl to allow an application to use DRAM-only, CXL-only, or interleave between the two.

Therefore, I think there is a mistake in pci_find_dvsec_capability() or the cxl_pci code.

I don't see the error on real CXL Type 3 devices, including some that are FPGA implementations, so it's up to you if you want to continue to resolve this or not as it will require you to implement the missing data within the FPGA. Looking at the code, this shouldn't impact what you need to move forward, so it's up to you.

Your FPGA stream is not 100% feature complete with respect to the DVSEC, and likely elsewhere too, which is what I'd expect. The FPGA route takes you down becoming a 'CXL Device Vendor' path where some input/development is required from your side to fill in the gaps.

As you said, the error originates from trying to access the DVSEC features. From https://github.com/torvalds/linux/blob/929ed21dfdb6ee94391db51c9eedb63314ef6847/drivers/pci/pci.c#L753, the comment says why:

/**
 * pci_find_dvsec_capability - Find DVSEC for vendor
 * @dev: PCI device to query
 * @vendor: Vendor ID to match for the DVSEC
 * @dvsec: Designated Vendor-specific capability ID
 *
 * If DVSEC has Vendor ID @vendor and DVSEC ID @dvsec return the capability
 * offset in config space; otherwise return 0.
 */
u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec)
{
	int pos;

	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC);
	if (!pos)
		return 0;

	while (pos) {
		u16 v, id;

		pci_read_config_word(dev, pos + PCI_DVSEC_HEADER1, &v);
		pci_read_config_word(dev, pos + PCI_DVSEC_HEADER2, &id);
		if (vendor == v && dvsec == id)
			return pos;

		pos = pci_find_next_ext_capability(dev, pos, PCI_EXT_CAP_ID_DVSEC);
	}

	return 0;
}

Your device implements the Vendor ID (1e98) and some DVSEC info, but the problem accessing the Extended DVSEC info:

	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC);

Which is looking for fields described in the comment:

/**
 * pci_find_ext_capability - Find an extended capability
 * @dev: PCI device to query
 * @cap: capability code
 *
 * Returns the address of the requested extended capability structure
 * within the device's PCI configuration space or 0 if the device does
 * not support it.  Possible values for @cap include:
 *
 *  %PCI_EXT_CAP_ID_ERR		Advanced Error Reporting
 *  %PCI_EXT_CAP_ID_VC		Virtual Channel
 *  %PCI_EXT_CAP_ID_DSN		Device Serial Number
 *  %PCI_EXT_CAP_ID_PWR		Power Budgeting
 */
u16 pci_find_ext_capability(struct pci_dev *dev, int cap)
{
	return pci_find_next_ext_capability(dev, 0, cap);
}

You can see some of this is missing in the lspci output above:

        Capabilities: [c00 v1] Device Serial Number 00-00-00-00-00-00-00-00 <<<
	Capabilities: [c10 v1] Designated Vendor-Specific: Vendor=1e98 ID=0007 Rev=0 Len=20: CXL
		Revision 0 not supported. <<<

This is an exercise for you to implement.

Is cxl_pci just an example driver for CXL (to write our own custom driver) or an ultimate driver to always be used for all the CXL devices?

The Kernel drivers are still under development relative to the CXL Specifications, mainly CXL 3.0 at this point, but they have matured enough for CXL 1.1 and 2.0 to be production ready for real CXL devices (including FPGA). The drivers are further along the CXL roadmap than the CPUs. The drivers implement the CXL specifications so they are vendor neutral and should work with all CXL devices. CXL device vendors are free to write their own custom drivers to add additional features and functionality for their devices beyond what the CXL specs say. You have the necessary Kernel requirements (6.3 or newer), although if you encounter OS/Kernel problems, it's worth reaching out to the CXL Kernel community for help or to file bugs.

@alexisfrjp
Copy link
Author

alexisfrjp commented Jun 5, 2023

Unfortunately, it's still loaded as system-ram.

Hopefully, this command will allow you to convert the devdax device into a system-ram and see the new NUMA node that applications can use. It'll look like your first numactl -H output.

sudo daxctl reconfigure-device --mode=system-ram --no-online all

I see the dax device file but it's already in system-ram mode.

$ ls /dev | grep dax
dax0.0

$ daxctl list -iDMR
[
  {
    "path":"\/platform\/hmem.0",
    "id":0,
    "size":8589934592,
    "devices":[
      {
        "chardev":"dax0.0",
        "size":8589934592,
        "target_node":1,
        "align":2097152,
        "mode":"system-ram"
      }
    ]
  }
]

I tried your command:

$ sudo daxctl reconfigure-device --mode=system-ram --no-online all
dax0.0: error: kernel policy will auto-online memory, aborting
error reconfiguring devices: Device or resource busy
reconfigured 0 devices

Checking lsmem, it's still "online".

$ lsmem -o+ZONES,NODE
RANGE                                  SIZE  STATE REMOVABLE BLOCK  ZONES NODE
0x0000000000000000-0x000000007fffffff    2G online       yes     0   None    0
0x0000000100000000-0x000000207fffffff  126G online       yes  2-64 Normal    0
0x0000002080000000-0x000000227fffffff    8G online       yes 65-68 Normal    1

Memory block size:         2G
Total online memory:     136G
Total offline memory:      0B

and numactl:

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 0 size: 128276 MB
node 0 free: 125964 MB
node 1 cpus:
node 1 size: 8192 MB
node 1 free: 8190 MB
node distances:
node   0   1 
  0:  10  14 
  1:  14  10 

Q) Do you see any entries in /dev/cxl/mem*? If not, this is why cxl list isn't reporting anything.

Sorry, I still don't see any /dev/cxl/mem* entries.

Removing efi=nosoftreserve enabled the dax device file and nothing else.


DVSEC issues

Of course I want to fix all the issues even if they aren't blocking. More I solve, more I learn.
For the first move with CXL, we decided to play with something that was supposed to be already working and validated. Hence my surprise. We use the Intel encrypted CXL IP core that trains as a CXL 1.1 device and implements some CXL 2.0 features.

Moreover, my company is a member of the CXL consortium, I have access to all the specs. It's the ultimate goal to develop our own core.

You can see some of this is missing in the lspci output above:

        Capabilities: [c00 v1] Device Serial Number 00-00-00-00-00-00-00-00 <<<
	Capabilities: [c10 v1] Designated Vendor-Specific: Vendor=1e98 ID=0007 Rev=0 Len=20: CXL
		Revision 0 not supported. <<<

This is an exercise for you to implement.

Regarding the Revision 0 not supported., lspci doesn't have the code to parse it. I checked lspci's code and the CXL support is very minimal. I changed the code, compiled, it was able to parse the DVSEC cap.

Here is the result:

	Capabilities: [c10 v1] Designated Vendor-Specific: Vendor=1e98 ID=0007 Rev=0 Len=20: CXL
		FBCap:	Cache- IO+ Mem- 68BFlit- MltLogDev-
		FBCtl:	Cache- IO+ Mem- SynHdrByp- DrftBuf- 68BFlit- MltLogDev- RCD- Retimer1- Retimer2-
		FBSta:	Cache- IO+ Mem- SynHdrByp- DrftBuf- 68BFlit- MltLogDev-

I still think there is an issue in kernel/driver code since lspci is able to detects all the PCIe extended's DVSEC capabilities. The DVSEC ID=0000 is clearly present.

I'm definitely not a big fan of these CXL1.1 devices with 2.0 features... It's very confusing.


Understood, thank you! It's perfect if it's production-ready for CXL1.1 and 2.0!

@alexisfrjp
Copy link
Author

I was able to force it:

$ sudo daxctl reconfigure-device --mode=system-ram --no-online -f dax0.0
dax0.0 was already in system-ram mode
[
  {
    "chardev":"dax0.0",
    "size":8589934592,
    "target_node":1,
    "align":2097152,
    "mode":"system-ram",
    "online_memblocks":3,
    "total_memblocks":4
  }
]
reconfigured 1 device

It lost 2GB.

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 0 size: 128276 MB
node 0 free: 124809 MB
node 1 cpus:
node 1 size: 6144 MB
node 1 free: 6142 MB
node distances:
node   0   1
  0:  10  14
  1:  14  10

@sscargal
Copy link

sscargal commented Jun 6, 2023

Great progress & updates!

Regarding this message:

$ sudo daxctl reconfigure-device --mode=system-ram --no-online all
dax0.0: error: kernel policy will auto-online memory, aborting. <<<<<<
error reconfiguring devices: Device or resource busy
reconfigured 0 devices

Fedora is one of those distros that choose to compile the Kernel to auto-online memory. This is a distro choice.

You can change this behavior using the following. The Kernel config file should include:

# grep ONLINE /boot/config-$(uname -r)
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y

To disable this feature:

/// Check the current value
# cat /sys/devices/system/memory/auto_online_blocks
online 

/// Prevent the Kernel from onlining memory 
# echo offline > /sys/devices/system/memory/auto_online_blocks

/// Confirm the change was successful
# cat /sys/devices/system/memory/auto_online_blocks
offline

This change doesn't persist across system reboots.

See the Kernel Hot Plug documentation to learn more on how to enable/disable memory blocks and to change their zone.


I love the changes to lspci. Thank you! For sure, it's not on the radar of the CXL Kernel devs, so it's not surprising that it's missing a lot of support for CXL devices. Please submit the patches for review and upstreaming so others can benefit.

https://lore.kernel.org/linux-cxl/ is useful for viewing the Linux Kernel CXL developer mailing list. Feel free to join the list if you want to. It's highly active. This is the list to email for support, suggestions, or patch submissions.


Regarding this action:

$ sudo daxctl reconfigure-device --mode=system-ram --no-online -f dax0.0
dax0.0 was already in system-ram mode
[
  {
    "chardev":"dax0.0",
    "size":8589934592,
    "target_node":1,
    "align":2097152,
    "mode":"system-ram",
    "online_memblocks":3,
    "total_memblocks":4
  }
]
reconfigured 1 device

Since the Kernel auto-onlines memory, the --no-online has no impact as it can't override the Kernel default behavior.

The reason you lost 2GB is because 3 of the 4 memory blocks are ONLINE

    "online_memblocks":3,
    "total_memblocks":4

The Kernel uses 2GiB memory block sizes. I suspect the cause of the loss of one block is likely an alignment problem caused by forcing the memory online of an already online devdax -> dax0.0 was already in system-ram mode. Rebooting the host will bring this back to the defaults.

@alexisfrjp
Copy link
Author

alexisfrjp commented Jun 6, 2023

Now I have a lot to read and to understand and I will come back to you later.
Of course I always submit patches. I will follow the mailing list but I'm not sure to have enough knowledge to participate yet.
Thanks @sscargal !

(I also have to check exactly why the kernel doesn't find the PCIe extended capability of the CXL's DVSEC whereas lspci does.)

@Yemaoxin
Copy link

@sscargal Hello, I have read the entire conversation. The issue I am facing is that I want to use CXL as universal RAM and make it an independent NUMA node.However, I am using a QEMU-emulated DRAM device, and for some reason, when I run 'numactl -H', it cannot find any zNUMA nodes.

@Yemaoxin
Copy link

QEMU Version: 8.0.50
Linux kernel: 6.3.4

@Yemaoxin
Copy link

Your help would be greatly appreciated!

@alexisfrjp
Copy link
Author

alexisfrjp commented Jun 10, 2023

@Yemaoxin Please open your own ticket, your problem is different than mine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants