



# NVIDIA GB NVL72 System

Validation Guide

Rik Kisnah NVIDIA Confidential Oracle Labs - NVL  
1122395 2025-11-28 16:00:27

# Document History

VG-12022-001\_v09

| <b>Version</b> | <b>Date</b>        | <b>Authors</b> | <b>Description of Change</b>                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|----------------|--------------------|----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 01             | September 11, 2024 | JS, SM         | Initial release                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| 02             | October 3, 2024    | JS, SM         | <ul style="list-style-type: none"> <li>&gt; Modified doc type from “Application Note” to “Validation Guide”</li> <li>&gt; Added Chapter 2 “Leveraging Guidelines for NVL36 Transition to NVL72”</li> <li>&gt; Updated P/F criteria for L10 and L11 box-package testing</li> <li>&gt; Updated test descriptions for items under L10 and L11 Reliability</li> <li>&gt; Added guidelines on RMII NC-SI validation</li> <li>&gt; Added guidelines for Input EDPP testing</li> </ul> |
| 03             | November 7, 2024   | JS, SM         | <ul style="list-style-type: none"> <li>&gt; Added PDB static and dynamic load validation for L10</li> <li>&gt; Added CPU JTAG validation for L10</li> <li>&gt; Updated SPI functional validation procedures</li> <li>&gt; Added timing specification for SPI</li> </ul>                                                                                                                                                                                                         |
| 04             | December 23, 2024  | JS, SM         | <ul style="list-style-type: none"> <li>&gt; Updated Section 3.4.1 for L10 level input EDPP testing</li> <li>&gt; Updated Section 3.6.8 SGPIO electrical validation for SGPIO timing parameters</li> <li>&gt; Updated Section 5.1.8 Rack Power EDPP</li> <li>&gt; Added Chapter 6 Appendix with sections on input EDPP waveform capture using HiRes mode and input EDPP post-processing guidelines</li> </ul>                                                                    |
| 05             | March 5, 2025      | JS, GP, SM     | <ul style="list-style-type: none"> <li>&gt; Added Section 1.2.4 related to NVNetPerf Tool</li> <li>&gt; Added Section 3.7 on network I/O board validation.</li> <li>&gt; Added Section 3.9.4 on information for performing L10 partner diagnostics WAT</li> <li>&gt; Added Table 3-12 on loopback test cables for BF-3, CX-7 and NVLink for partner diagnostics.</li> <li>&gt; Updated Table 3-18. EMC Tests Summary with updated test conditions</li> </ul>                    |
| 06             | April 14, 2025     | JS, SM         | <ul style="list-style-type: none"> <li>&gt; Removed line item SY.18 from Table 2-1</li> <li>&gt; Updated Section 5.1.8 “Rack Power AC Apparent Power”</li> <li>&gt; Updated test conditions in Table 5-1, Table 5-2, and Table 5-3</li> </ul>                                                                                                                                                                                                                                   |
| 07             | July 28, 2025      | GP, SM         | <ul style="list-style-type: none"> <li>&gt; Updated this validation guide to make it common for both GB200 NVL72 and GB300 NVL72.</li> <li>&gt; Updated the title to <i>NVIDIA GB NVL72 System Validation Guide</i>.</li> <li>&gt; Updated Section 3.7 to cover both CX-7 and CX-8 validation.</li> </ul>                                                                                                                                                                       |

| <b>Version</b> | <b>Date</b>        | <b>Authors</b> | <b>Description of Change</b>                                            |
|----------------|--------------------|----------------|-------------------------------------------------------------------------|
|                |                    |                | > Updated Table 3-12 with CX-8 Loopback cable details                   |
| 08             | September 10, 2025 | GP, SM         | Added Chapter 7 “Validation Leverage Guidance for GB300 NVL72 Systems”. |
| 09             | October 20, 2025   | GP, SM         | Updated system package testing information in Table 3-13 and Table 5-1  |

Rik Kisnah NVIDIA Confidential Oracle Labs - NVL  
 1122395 2025-11-28 16:00:27

# Table of Contents

|                                                                      |    |
|----------------------------------------------------------------------|----|
| Chapter 1. Introduction.....                                         | 1  |
| 1.1 Partner Validation Playbook.....                                 | 1  |
| 1.2 Partner Validation Toolset.....                                  | 1  |
| 1.2.1 NVQual.....                                                    | 2  |
| 1.2.2 NVSSVT .....                                                   | 2  |
| 1.2.3 NVRASTool.....                                                 | 2  |
| 1.2.4 NVNetPerf Tool .....                                           | 2  |
| 1.3 Minimum Hardware Configuration for Partner Validation .....      | 3  |
| 1.4 Attachments .....                                                | 5  |
| Chapter 2. Leveraging Guidelines for NVL36 Transition to NVL72 ..... | 6  |
| Chapter 3. L10 Compute Tray Hardware Validation.....                 | 9  |
| 3.1 Mechanical.....                                                  | 9  |
| 3.2 Thermal.....                                                     | 10 |
| 3.2.1 Host BMC Shdn, Event Logging for THERM_OVERT_N Assertion.....  | 10 |
| 3.2.2 Host BMC Shdn, Event Logging for FPGA_OVERT_N Assertion .....  | 11 |
| 3.3 Liquid Cooling .....                                             | 11 |
| 3.3.1 Leak Detection .....                                           | 12 |
| 3.4 Power Validation.....                                            | 14 |
| 3.4.1 Input EDP Peak Testing.....                                    | 14 |
| 3.4.2 GPU Power Brake .....                                          | 15 |
| 3.4.3 Power Distribution Board Static Loads.....                     | 16 |
| 3.4.4 Power Distribution Board Dynamic Loads .....                   | 16 |
| 3.5 PCIe.....                                                        | 17 |
| 3.5.1 PCIe Speed and Width Configuration Check .....                 | 17 |
| 3.5.2 PCIe Link Training Steady State Machine .....                  | 18 |
| 3.6 I/O Interface .....                                              | 19 |
| 3.6.1 CPU JTAG Scan Check from BMC .....                             | 20 |
| 3.6.2 Dual CPU JTAG Scan Check from BMC.....                         | 20 |
| 3.6.3 CPU JTAG Electrical Validation.....                            | 21 |
| 3.6.4 I2C Functional Bus Scan.....                                   | 22 |
| 3.6.5 I2C Electrical Validation.....                                 | 23 |
| 3.6.6 SPI Functional Check.....                                      | 24 |
| 3.6.7 SPI Electrical Validation.....                                 | 26 |
| 3.6.8 GPIO Electrical Validation.....                                | 27 |
| 3.6.9 UART Functional Check .....                                    | 30 |
| 3.6.10 UART Electrical Validation.....                               | 31 |
| 3.6.11 RMII Network Controller Sideband Interface.....               | 33 |

|            |                                                                        |    |
|------------|------------------------------------------------------------------------|----|
| 3.6.12     | RMII NC-SI Electrical Validation.....                                  | 37 |
| 3.7        | Networking I/O Board.....                                              | 41 |
| 3.7.1      | CX-7 and CX-8 Mezzanine Network Card Ethernet Validation .....         | 41 |
| 3.7.2      | CX-7 and CX-8 Mezzanine Network Card I/O Board InfiniBand Validation   | 49 |
| 3.8        | BMC .....                                                              | 52 |
| 3.8.1      | BMC FRU Write .....                                                    | 52 |
| 3.8.2      | BMC Power Control (STANDBY, RUN, AUX).....                             | 53 |
| 3.9        | System.....                                                            | 54 |
| 3.9.1      | System Reboot and Power Cycle Stress Testing .....                     | 54 |
| 3.9.2      | Wide Area Test.....                                                    | 55 |
| 3.9.3      | Reboot WAT .....                                                       | 56 |
| 3.9.4      | L10 Partner Diagnostics WAT.....                                       | 57 |
| 3.10       | Environmental, Reliability, and Electromagnetic Compatibility .....    | 58 |
| 3.10.1     | L10 Package Testing.....                                               | 58 |
| 3.10.2     | Shock and Vibration Test.....                                          | 60 |
| 3.10.3     | Environmental Reliability Test.....                                    | 62 |
| 3.10.4     | Electromagnetic Compatibility .....                                    | 64 |
| Chapter 4. | L10 Compute Tray System Software .....                                 | 67 |
| 4.1        | System Software Validation Guidelines .....                            | 68 |
| Chapter 5. | L11 Server Rack Hardware Validation.....                               | 70 |
| 5.1        | Rack Power .....                                                       | 70 |
| 5.1.1      | Rack Power Startup and Shutdown.....                                   | 71 |
| 5.1.2      | Rack Power Startup and Shutdown without PSC .....                      | 71 |
| 5.1.3      | Rack Power Load Voltage Regulators under Static Load.....              | 72 |
| 5.1.4      | Rack Power Load Voltage Regulators under Dynamic Load .....            | 73 |
| 5.1.5      | Rack Power Load Hot Swap PSU .....                                     | 74 |
| 5.1.6      | Rack Power Load Hot Swap PSC.....                                      | 75 |
| 5.1.7      | Rack Power Fault Recovery .....                                        | 76 |
| 5.1.8      | Rack Power AC Apparent Power .....                                     | 77 |
| 5.1.9      | Rack Power Noise.....                                                  | 85 |
| 5.1.10     | Rack Power Firmware Update .....                                       | 85 |
| 5.1.11     | PSC Reboot Cycles .....                                                | 86 |
| 5.1.12     | Rack Power Factor.....                                                 | 86 |
| 5.2        | L11 Environmental, Reliability .....                                   | 87 |
| 5.2.1      | L11 Packaging .....                                                    | 87 |
| 5.2.2      | L11 Shock and Vibration Test .....                                     | 89 |
| 5.2.3      | L11 Environmental Reliability Test.....                                | 91 |
| Chapter 6. | Appendix 1: Oscilloscope Acquisition Modes for EDPP Measurements ..... | 93 |
| 6.1        | Moving Average Measurement using Oscilloscope HiRes Mode .....         | 93 |
| 6.2        | Input EDPP Post-Processing Guideline .....                             | 95 |

Rik Kisnah NVIDIA Confidential Oracle Labs - NVL  
1122395 2025-11-28 16:00:27

# List of Figures

|             |                                                                  |    |
|-------------|------------------------------------------------------------------|----|
| Figure 3-1. | CPU JTAG Electrical Validation.....                              | 22 |
| Figure 3-2. | SPI Electrical Validation.....                                   | 26 |
| Figure 3-3. | SGPIO Bus Overview .....                                         | 28 |
| Figure 3-4. | SGPIO Timing Diagram.....                                        | 28 |
| Figure 3-5. | UART T <sub>BIT</sub> Definition.....                            | 32 |
| Figure 3-6. | RMII NC-SI Diagram.....                                          | 34 |
| Figure 3-7. | RMII NC-SI DC Specifications.....                                | 38 |
| Figure 3-8. | RMII NC-SI AC Specifications .....                               | 39 |
| Figure 5-1. | L11 Power Measurement Block Diagram.....                         | 77 |
| Figure 5-2. | Power Whip Wiring Diagram .....                                  | 79 |
| Figure 5-3. | Current Probes and Voltage Probes Setup and Power Whip .....     | 80 |
| Figure 5-4. | Connected Power Cable and Current Probe Attachments Example..... | 82 |
| Figure 6-1. | Acquisition Mode.....                                            | 94 |
| Figure 6-2. | Oscilloscope Setting 200 $\mu$ s Moving Average Example .....    | 95 |

# List of Tables

|             |                                                                  |    |
|-------------|------------------------------------------------------------------|----|
| Table 1-1.  | L10 Validation Minimum Hardware Configuration .....              | 3  |
| Table 1-2.  | L11 Validation Minimum Hardware Configuration .....              | 4  |
| Table 2-1.  | 2RU to 1RU Compute Tray Transition Test Items.....               | 6  |
| Table 3-1.  | Mechanical Checklist.....                                        | 9  |
| Table 3-2.  | PCIe Link Specification and Minimum Hardware Configuration.....  | 17 |
| Table 3-3.  | Grace CPU PCIe LTSSM .....                                       | 18 |
| Table 3-4.  | JTAG Timing Specification .....                                  | 22 |
| Table 3-5.  | SPI Timing Parameters .....                                      | 26 |
| Table 3-6.  | SGPIO Timing Parameters.....                                     | 28 |
| Table 3-7.  | UART DC Specifications.....                                      | 32 |
| Table 3-8.  | UART AC Specifications.....                                      | 32 |
| Table 3-9.  | List of FRU EEPROMs in GB200 and GB300 Bianca Compute Tray ..... | 52 |
| Table 3-10. | Compute Tray WAT .....                                           | 56 |
| Table 3-11. | Reboot WAT .....                                                 | 56 |
| Table 3-12. | BF-3, CX-7, CX-8, NVLink, Loopback Cable, and Test Cage.....     | 57 |
| Table 3-13. | L10 Packaging Test Summary .....                                 | 58 |
| Table 3-14. | Shock and Vibration Test Summary .....                           | 60 |
| Table 3-15. | Mechanical Shock – Half Sine (Operating) .....                   | 61 |
| Table 3-16. | Mechanical Shock – Trapezoidal (Non-operating) .....             | 61 |
| Table 3-17. | Reliability Test Summary.....                                    | 62 |
| Table 3-18. | EMC Tests Summary.....                                           | 65 |
| Table 4-1.  | Reference Documentation and Collateral.....                      | 67 |
| Table 4-2.  | System Software Validation Example.....                          | 68 |
| Table 5-1.  | L11 Packaging Test Summary .....                                 | 87 |
| Table 5-2.  | L11 Shock and Vibration Test Summary .....                       | 89 |
| Table 5-3.  | L11 Reliability Test Summary .....                               | 91 |
| Table 6-1.  | Input EDPp and Samples per Window in Sample Mode .....           | 95 |

---

# Chapter 1. Introduction

This validation guide describes the tools and methodologies that NVIDIA provides to validate and test partner platforms built for the NVIDIA GB200 NVL36, GB200 NVL72, and GB300 NVL72 systems. The validation procedure and tools in this application note are provided as a reference. Partners must supplement additional qualifications that apply to their customization.

## 1.1 Partner Validation Playbook

The Partner Validation Playbook (PVP) provides a systematic and standardized way to validate NVIDIA data center products, aiming to improve test effectiveness, product quality, and time-to-market.

The PVP is a workbook with detailed validation items for each ES, QS, and PS release milestone. This validation guide complements the PVP and provides instructions about how to complete these validation line items.

Refer to the *GB200 NVL System Partner Validation Playbook* (NVOnline: 1115639) and the *GB300 NVL72 System Partner Validation Playbook* (NVOnline: 1128352) for the validation line items. Partners should validate each validation line item to uncover potential issues before production.

The validation guide may not be comprehensive in covering all test items outlined in the PVP. Partners should adapt the procedures outlined in this validation guide to best fit their validation process.

Partners also should report their validation status and issues using the NVOnline portal. Refer to the *NVOnline Partner Portal Guidelines* (NVOnline: 1106745) for information about the best practices on validation review requests and NVBugs submission.

## 1.2 Partner Validation Toolset

This section describes the toolset available for partner validation of GB200 and GB300 NVL systems. Refer to the latest *GB200 NVL72 Partner Enablement Deck* (NVOnline: 1114628) and *GB300 NVL72 Partner Enablement Deck* (NVOnline: 1124881) for the required toolkit version for each sampling milestone.

## 1.2.1 NVQual

The NVQual program is a software program that allows partners to qualify NVIDIA hardware in their system. Partners must run NVQual and submit the results back to NVIDIA for review as part of their validation process.

NVQual can be downloaded from the online posting *NVQual for NVIDIA Grace-Blackwell Platforms* (NVOnline: 1119931).

## 1.2.2 NVSSVT

NVIDIA System Software Validation Toolkit (NVSSVT) is a toolkit developed for server platforms and includes a binary that contains the system management, performance, and other validation suites. Partners must run the NVSSVT tool and submit the results along with the PVP as part of the validation process.

NVSSVT can be downloaded from the online posting *NVIDIA System Software Validation Tool (NVSSVT)* (NVOnline: 1108364).



**Note:** The validation line items mentioned in this application note can be validated using the NVSSVT tool.

The following procedures are for partner references only. Partners must update their test plans for their BMC software implementation, after referring to this application note.

## 1.2.3 NVRASSTool

The NVIDIA Reliability, Availability, and Serviceability (RAS) Tool provides a unified way to test the RAS features of NVIDIA hardware. The tool provides test coverage for Out-Of-Band (OOB), In-Band hardware error, and AML software injection. NVIDIA RASTool (NVRASSTool) is available on NVOnline: 1112947.



**Note:** The NVIDIA NVRASSTool runs on a separate machine from the computer that is being tested, remotely executes the tests automatically with no user intervention, and gathers the logs.

## 1.2.4 NVNetPerf Tool

The NVNetPerf Tool is a performance tool that configures multiple nodes with DPU and CX cards and tests them by running workloads to stress and identify points of failure which might not be detected when testing individual DPU or CX cards. This comprehensive test reduces risks associated with system deployment. Additionally, the NVNetPerf test assesses both North-South (N-S) and East-West (E-W) network traffic, ensuring the system operates with the most optimized and up-to-date configurations and software. This test can detect configuration issues that are not detected testing individual DPU or CX Cards.

The NVNetPerf Tool and its user guide, which details the test configuration, procedure, and performance acceptance criteria, are listed in the NVNetPerf Kit (formerly DoCA Perf test) with NVOnline: 1119887.

## 1.3 Minimum Hardware Configuration for Partner Validation

For faster time to market, partners can start their validation process while building out their compute tray and their multi-node server rack. Partners do not need to populate the entire compute tray or entire server rack to start their validation.

The PVP highlights the minimum hardware configuration that is required to perform each validation test item. Partners can jump-start their validation at ES with the minimum hardware configuration to uncover any issues early on.

Table 1-1 explains the minimum hardware configuration at the compute tray validation. Table 1-2 explains the minimum hardware configuration for the server rack validation.

**Table 1-1. L10 Validation Minimum Hardware Configuration**

| Minimum Hardware Configuration | Description                                                                                                                                                                                                                                                                                                                                                                                                         |
|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Core Tray                      | <p>2x Bianca boards</p> <ul style="list-style-type: none"> <li>&gt; 1x CPU per board</li> <li>&gt; 2x GPUs per board</li> </ul> <p>Accessory boards:</p> <ul style="list-style-type: none"> <li>&gt; HMC integrated</li> <li>&gt; BMC interposer card and BMC integrated</li> <li>&gt; M.2 riser card and M.2 installed</li> <li>&gt; 1x Front I/O board</li> <li>&gt; 1x Power distribution board (PDB)</li> </ul> |
| Core Tray + GPU-Less           | <p>2x Bianca boards</p> <ul style="list-style-type: none"> <li>&gt; 1x CPU per board</li> <li>&gt; 0x GPU per board</li> </ul> <p>Accessory boards:</p> <ul style="list-style-type: none"> <li>&gt; HMC integrated</li> <li>&gt; BMC interposer card and BMC integrated</li> <li>&gt; M.2 riser card and M.2 installed</li> <li>&gt; 1x Front I/O board</li> <li>&gt; 1x Power distribution board (PDB)</li> </ul>  |
| Core Tray + NVLink loopbacks   | <p>2x Bianca boards</p> <ul style="list-style-type: none"> <li>&gt; 1x CPU per board</li> <li>&gt; 2x GPUs per board</li> </ul> <p>Accessory boards:</p>                                                                                                                                                                                                                                                            |

| Minimum Hardware Configuration | Description                                                                                                                                                                                                                                                                                                                                                                                                                      |
|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                | <ul style="list-style-type: none"> <li>&gt; HMC integrated</li> <li>&gt; BMC interposer card and BMC integrated</li> <li>&gt; M.2 riser card and M.2 installed</li> <li>&gt; 1x Front I/O board</li> <li>&gt; 1x Power distribution board (PDB)</li> <li>NVLink loopback cable installed</li> </ul>                                                                                                                              |
| Core Tray + TPM + CPU only     | 2x Bianca boards <ul style="list-style-type: none"> <li>&gt; 1x CPU per board</li> <li>&gt; 0x GPUs per board</li> </ul> Accessory boards: <ul style="list-style-type: none"> <li>&gt; HMC integrated</li> <li>&gt; BMC interposer card and BMC integrated</li> <li>&gt; M.2 riser card and M.2 installed</li> <li>&gt; 1x Front IO board</li> <li>&gt; 1x Power distribution board (PDB)</li> <li>&gt; 1x TPM module</li> </ul> |
| Full Tray                      | All L10 BOM fully integrated                                                                                                                                                                                                                                                                                                                                                                                                     |
| Full Tray + GPU-Less           | All L10 BOM is fully integrated with exceptions any GPUs                                                                                                                                                                                                                                                                                                                                                                         |
| Full Tray + NVLink loopbacks   | All L10 BOM fully integrated with NVLink loopback cables installed                                                                                                                                                                                                                                                                                                                                                               |
| PDB + E-Load                   | Power distribution board with electronic load test equipment                                                                                                                                                                                                                                                                                                                                                                     |

**Table 1-2. L11 Validation Minimum Hardware Configuration**

| Minimum Hardware Configuration | Description                                                                                                                                                                                                                                                                                                      |
|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Full Rack                      | Server rack with all compute trays and switch trays populated <ul style="list-style-type: none"> <li>&gt; Bus bar integrated</li> <li>&gt; Power shelf integrated</li> <li>&gt; Manifold integrated</li> <li>&gt; Cables and rails integrated</li> </ul>                                                         |
| Full Rack + GPU-Less           | Server rack with all compute trays and switch trays populated <ul style="list-style-type: none"> <li>&gt; No GPUs are required in the compute trays</li> <li>&gt; Bus bar integrated</li> <li>&gt; Power shelf integrated</li> <li>&gt; Manifold integrated</li> <li>&gt; Cables and rails integrated</li> </ul> |
| Rack + 1 Full Tray             | Server rack with a single fully populated compute tray <ul style="list-style-type: none"> <li>&gt; Bus bar integrated</li> <li>&gt; Power shelf integrated</li> <li>&gt; Manifold integrated</li> <li>&gt; Cables and rails integrated</li> </ul>                                                                |

| Minimum Hardware Configuration      | Description                                                                                                                                                                                              |
|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Rack with no compute or switch tray | Server rack without any compute tray and switch tray<br>> Bus bar integrated<br>> Power shelf integrated<br>> Manifold integrated<br>> Cables and rails integrated                                       |
| Rack + E-Loads                      | Server rack without any compute tray and switch tray<br>> Electronic load for load testing<br>> Bus bar integrated<br>> Power shelf integrated<br>> Manifold integrated<br>> Cables and rails integrated |
| Rack + 1 Core Tray                  | Server rack with a core compute tray<br>See Table 1-1 for core tray definition                                                                                                                           |
| Full Rack Packaged on Pallet        | Server rack with all compute trays and switch trays populated, and packaged on pallet                                                                                                                    |

## 1.4 Attachments

The following files are attached to this validation guide. All example code and test scripts are provided for partner reference only. Partners are responsible for modifying and developing their own test scripts. In addition, partners are responsible for ensuring the integrity and completeness of all their validation activities including, but not limited to, test scripts, test results, and test data analysis.

- > CPU JTAG Test Script
  - The “jtag\_test” can be used to perform JTAG validation under Section 3.6.1 “CPU JTAG Scan Check from BMC.”
- > GB\_NVL72\_PVP\_Validation\_Leverage\_Guidance.nvzip

To access the attached files, click the **Attachment** icon on the left-hand toolbar on this PDF (using Adobe Acrobat Reader or Adobe Acrobat). Select the file and use the Tool Bar options (**Open, Save**) to retrieve the documents. Files with the .nvzip extension must be renamed to .zip and then can be extracted using 7-Zip file archive software or other archive software.

---

# Chapter 2. Leveraging Guidelines for NVL36 Transition to NVL72

Partners can leverage their previous qualification efforts when transitioning from a 2RU NVL36 configuration to a 1RU NVL72 configuration.

A subset of the validation test items must be retested on the 1RU compute tray as part of the compute tray transition. Refer to Table 2-1 for the test items that must be performed as part of the 2RU to 1RU compute tray transition. These test items cover areas of mechanical, thermal, power, high-speed PCIe, and I2C functional checks.

Partners transitioning from a 1RU NVL36 to 1RU NVL72 using the same compute tray, without any tray-level design changes, are allowed to leverage all their L10 validation.

For L11 rack-level validation, partners must complete the line items outlined in the PVP on their NVL72 rack system.

For additional test descriptions for the items listed in the following table, refer to the *GB200 NVL System Partner Validation Playbook* (NVOnline: 1115639) and the *GB300 NVL System Partner Validation Playbook* (NVOnline: 1128352).

**Table 2-1. 2RU to 1RU Compute Tray Transition Test Items**

| ID    | Category      | Subcategory                            | Test Item                                                                                                                       |
|-------|---------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|
| ME.9  | Mechanical    | Component fit check                    | Check all components fit inside tray                                                                                            |
| LC.5  | Liquid Cooled | LC Tray Leak - Small Leak Detection    | Trigger a small liquid leak (slow dripping) in a tray using PG25                                                                |
| LC.8  | Liquid Cooled | LC Tray Leak - Disconnect Sensor Fault | Unplug the leak detection sensor from the board                                                                                 |
| TH.20 | Thermal       | Thermal Stress Test (NVQual Test #1)   | Full power thermal stress test with all external components (that is, InfiniBand, SSD) at worst case ambient (chamber required) |
| PW.1  | Power         | Power EDPp (NVQual Test #7)            | NVQual Test#7 - Input power EDP peak testing for system power supplies                                                          |
| EC.1  | PCIe          | PCIe EOM (NVQual Test #17)             | Qualification - 50x PCIe EOM Margin Test with a PASSING EOM as per vendor spec                                                  |
| EC.2  | PCIe          | PCIe hot reset (NVQual Test #24)       | LTSSM - 10000x Hot Reset - SBR with training                                                                                    |

| <b>ID</b> | <b>Category</b> | <b>Subcategory</b>                                                    | <b>Test Item</b>                                                                             |
|-----------|-----------------|-----------------------------------------------------------------------|----------------------------------------------------------------------------------------------|
| EC.3 -12  | PCIe            | PCIe Speed Change LTSSM (NVQual Test #29)                             | LTSSM - 500x Speed Change                                                                    |
| EC.13     | PCIe            | PCIe link enable disable (NVQual Test #26)                            | LTSSM - 10000x PCIe Link Enable/Disable                                                      |
| EC.14     | PCIe            | PCIe retrain (NVQual Test #27)                                        | LTSSM - 10000x PCIe retrain                                                                  |
| EC.15     | PCIe            | PCIe L1 transitions (NVQual Test #30)                                 | LTSSM - 10000x Power management L1 transition                                                |
| EC.16     | PCIe            | PCIe D3 transitions (NVQual Test #31)                                 | LTSSM - 10000x Power management D3 transition                                                |
| EC.17     | PCIe            | PCIe Tx Equalization Test (NVQual Test #25)                           | LTSSM - 10000x Tx Eq Redo                                                                    |
| IC.5      | I2C             | BMC I2C-6, and UPHY1 1P functional                                    | Test connections between BMC I2C-6 and UPHY1 from the first Bianca compute board             |
| IC.6      | I2C             | BMC I2C-6 and UPHY1 2P functional                                     | Test connections between BMC I2C-6 and UPHY1 from the second PG548                           |
| IC.7      | I2C             | BMC I2C-6, and UPHY0 1P functional                                    | Test connections between BMC I2C-6 and UPHY0 from the first PG548                            |
| IC.8      | I2C             | BMC I2C-6 and UPHY0 2P functional                                     | Test connections between BMC I2C-6 and UPHY0 from the second PG548                           |
| IC.9      | I2C             | BMC I2C-6, Mezzanine Network Board 1 IO expander, and FRU functional  | Test connections between BMC I2C-6 and Mezzanine Network Board 1 I/O expander and FRU        |
| IC.10     | I2C             | BMC I2C-6, Mezzanine Network Board 2 I/O expander, and FRU functional | Test connections between BMC I2C-6 and Mezzanine Network Board 2 I/O expander and FRU        |
| IC.11     | I2C             | BMC I2C-1 SSIF functional                                             | Test SSIF connection between FPGA and BMC I2C-1                                              |
| IC.17     | I2C             | BMC I2C-3, HMC I2C-4 and FPGA 1 functional                            | Test connections between FPGA #1, FPGA #1 ERoT, HMC I2C-4, HMC FRU and BMC I2C-3             |
| IC.19     | I2C             | BMC I2C-2, HMC I2C-1, HMC ERoT, and FPGA 2 functional                 | Test connections between BMC, HMC ERoT, HMC I2C-1, FPGA #2 ERoT and FPGA #2                  |
| IC.22     | I2C             | HMC I2C-13, BMC I2C-16 and UPHY3 2P functional                        | Test connections between HMC I2C-13, BMC I2C-16 and UPHY3 from second PG548                  |
| IC.23     | I2C             | BMC I2C-5, RTC and I/O expander functional                            | Test connections between BMC I2C-5, RTC and I/O expander                                     |
| IC.24     | I2C             | BMC I2C-6 and I2C Switches functional                                 | Test connections between BMC I2C-6 and I2C Switches on both carrier boards                   |
| IC.25     | I2C             | BMC I2C-7, PDB, Leak Sensors, and Fan controllers functional          | Test connections between BMC I2C-7, both boards fan controllers, leak sensors, and PDB board |
| IC.26     | I2C             | BMC I2C-9, HMC-7, and M.2 functional                                  | Test connections between BMC I2C-9, HMC-7 and M.2                                            |

| <b>ID</b> | <b>Category</b> | <b>Subcategory</b>                      | <b>Test Item</b>                                                                                                                                            |
|-----------|-----------------|-----------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
| IC.27     | I2C             | BMC I2C-10 and I/O expanders functional | Test connections between BMC I2C-10 and I/O expander on HMC and both I/O expanders on both carrier boards                                                   |
| IC.28     | I2C             | BMC I2C-11, TMP, and BMC FRU functional | Test connections between BMC I2C-11, temperature sensor, and BMC FRU                                                                                        |
| IC.29     | I2C             | BMC I2C-15 and UPHY3 1P functional      | Test connections between BMC I2C-15 and UPHY3 from the first PG548                                                                                          |
| SY.5      | System          | Warm reboot test                        | Test that the system can perform a warm reboot                                                                                                              |
| SY.6      | System          | AC reboot test                          | Test that the system can AC cycle (aux cycle)                                                                                                               |
| SY.7      | System          | DC reboot test                          | Test that the system can DC cycle (run power from BMC)                                                                                                      |
| SY.8      | System          | DC 2 reboot test                        | Test that the system can DC cycle (run power and standby power from BMC)                                                                                    |
| SY.15     | System          | AC reboot stress test                   | 100x AC cycles by completely removing power from tray, checking for reliable power control and ensuring OS, PCIe, MCTP, SSIF, and FPGA stability each cycle |
| SY.16     | System          | DC reboot stress test                   | 100x DC cycles (run power and standby power from BMC), check for reliable power control, and ensure OS, PCIe, MCTP, SSIF and FPGA stability each cycle      |
| SY.25     | System          | System Inventory check                  | Check the inventory (FRU list) of every node in the chassis                                                                                                 |
| SY.33     | System          | Tray hot swap                           | Unplug a tray while the system is turned on and plug it back in                                                                                             |

# Chapter 3. L10 Compute Tray Hardware Validation

This chapter describes additional recommended hardware validation items of the L10 compute tray.

## 3.1 Mechanical

Partners should go through the following items in Table 3-1 to verify the mechanical integration of the compute tray.

**Table 3-1. Mechanical Checklist**

| ID   | Subcategory                     | Test Item Description                                                                | Expected Results                                                                        |
|------|---------------------------------|--------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|
| ME.1 | Mechanical fitment of all PCBAs | Check the mechanical fit of all the PCBAs                                            | All PCBAs fit in the expected location and screw holes align with the chassis           |
| ME.2 | Mechanical Tray Weight          | Physically weigh the tray                                                            | Weight is in line with target weight                                                    |
| ME.3 | Physical Interference           | Inspect the tray for interferences or overstressed components under normal use cases | No component is overstressed and there are no mechanical interferences                  |
| ME.4 | Cable Routing Path              | Check all cable routing paths within the tray                                        | Ensure all cables are routed along the correct path                                     |
| ME.5 | Cable Routing Slack             | Check all cable lengths during normal operation in the tray                          | Ensure all cables are the correct length with no excessive slack or cables under stress |
| ME.6 | Cable Routing Radii             | Check all cable bend angles and locations in the tray                                | Ensure all cable bends are not exerting excessive stress on the cable                   |
| ME.7 | Cable Labeling                  | Check all cable labeling on all cables within the tray                               | Ensure all cables are correctly labeled within the tray                                 |
| ME.8 | Tray Labeling                   | Check all tray labeling within the tray and on the outside of the tray               | Ensure all labeling on and in the tray are correct                                      |

| ID    | Subcategory               | Test Item Description                                             | Expected Results                                                                          |
|-------|---------------------------|-------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
| ME.9  | Component fit check       | Check all components fit inside tray                              | All components fit as expected and can be successfully installed into the tray            |
| ME.10 | Tray screw alignment      | Check all board screw holes align with tray screw holes           | All boards and component screw holes align with tray screw holes                          |
| ME.11 | Liquid pipe routing path  | Check all liquid pipe routing paths within the tray               | Ensure all liquid pipes are routed along the correct path                                 |
| ME.12 | Liquid pipe routing slack | Check all liquid pipe lengths during normal operation in the tray | Ensure all liquid pipes are correct length with no excessive slack or cables under stress |
| ME.13 | Liquid pipe routing radii | Check all liquid pipe bend angles and locations in the tray       | Ensure all liquid pipe bends are not exerting excessive stress on the cable               |

## 3.2 Thermal

This section details the thermal testing and procedures for the GB200 and GB300 NVL systems.

### 3.2.1 Host BMC Shdn, Event Logging for THERM\_OVERT\_N Assertion

#### Purpose

The purpose of this test is to ensure that the partner takes the correct action within their BMC if components within the baseboard rise above the maximum operating temperature.

#### Prerequisites

- > Download the error injection toolkit within the *NVIDIA RASTool (NVRASTool)* (NVOnline:1112947) and follow instructions within the user's guide for any prerequisites required to run the toolkit on your host machine.
- > Implement a logging mechanism within the host BMC to log events, such as overtemperature, in non-volatile storage for RMA and debug analysis.

#### Test Procedure

1. Within the NVRASTool, assert hardware injection fault conditions to emulate a **THERM\_OVERT\_N**. By injecting an overtemperature condition within the following baseboard components, we can assert the **THERM\_OVERT\_N** signal:
2. Ensure that **THERM\_OVERT\_N** is asserted through the Out-of-Band (OOB) path.
3. Upon triggering **THERM\_OVERT\_N**, ensure the host system shuts down the baseboard within 1 second by de-asserting **GPU\_BASE\_PWR\_EN** or **GPU\_BASE\_STBY\_EN**.
4. Log the overtemperature event in the BMC.

- Power cycle the system chassis to reset the fault.

#### **Pass or Fail Criteria**

- The baseboard remains in a shutdown state until human intervention
- The BMC logs the **THERM\_OVERT\_N** event in non-volatile storage for RMA and debug analysis

### 3.2.2 Host BMC Shdn, Event Logging for **FPGA\_OVERT\_N** Assertion

#### **Purpose**

The purpose of this test is to ensure that the partner takes the correct action within their BMC if the FPGA rises above the maximum operating temperature.

#### **Prerequisites**

- Download the error injection toolkit within the *NVIDIA RASTool (NVRASTool)* (NVOnline: 1112947) and follow instructions within the user's guide for any prerequisites required to run the toolkit on your host machine.
- Implement a logging mechanism within the host BMC to log events, such as overtemperature, in non-volatile storage for RMA and debug analysis.

#### **Test Procedure**

- Within the NVRASTool, assert hardware injection fault conditions to emulate an **FPGA\_OVERT\_N**.
- Ensure that **FPGA\_OVERT\_N** is asserted through the Out-of-Band (OOB) path.
- Upon triggering **FPGA\_OVERT\_N**, ensure the host system shuts down the baseboard within 1 second by de-asserting **GPU\_BASE\_PWR\_EN** or **GPU\_BASE\_STBY\_EN**.
- Log the FPGA overtemperature event in the BMC.
- Power cycle the system chassis to reset the fault.

#### **Pass or Fail Criteria**

- Ensure the baseboard remains in a shutdown state until human intervention
- Ensure that the BMC logs the **FPGA\_OVERT\_N** event in non-volatile storage for RMA and debug analysis.

## 3.3 Liquid Cooling

This section details the liquid cooling testing and procedures for the GB200 and GB300 NVL systems.

### 3.3.1 Leak Detection

#### Purpose

Verify that the leak detection features on the compute tray are detecting liquid properly. Partners can verify the hardware feature through ADC readouts before checking the BMC firmware implementation. Refer to the *NVIDIA MGX Leak Detection Strategy and Remediation Application Note* (NVOnline: 1115991) for more information.

#### Prerequisites

Leak sensors are connected to the compute board. They do not need to be fully integrated into the tray chassis.

#### Test Procedure

1. Test the leak sensor ADC readings under dry, nominal conditions. Verify that the ADC normalized readings are within the defined passing threshold (0.49 – 0.55).
2. Stop the BMC Leak Detector service, the example code is as follows. Partners will need to modify the following command per their BMC implementation.

```
systemctl stop xyz.openbmc_project.leakdetectsensor.service #This application should automatically restart on next BMC reset
```

3. Unbind the drivers dependent on what ADCs you have connected.

```
echo ${ADC_ADDR} > /sys/bus/i2c/devices/i2c-6/delete_device #ADC_ADDR = [0x34, 0x35, 0x18, 0x1E]
```

4. For example, if you have a system with MAX1363 ADC parts:

```
root@gb200nvl-bmc:/# i2cdetect -y -q 6
      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
00:          -----0c----- --
10: -- 12 -- 14 -- -----4----- --
20: UU -- -- UU -----28----- UU -- -- UU
30: -- -- -- UU UU --37----- -- -- -- --
40: -----1----- -- -- -- -- 4e --
50: UU -- -- -- ----- -- -- -- --
60: 60 61 -- ----- -- -- --
70: ----- -- -- --

root@gb200nvl-bmc:/# systemctl stop xyz.openbmc_project.leakdetectsensor.service
root@gb200nvl-bmc:/# echo 0x34 > /sys/bus/i2c/devices/i2c-6/delete_device
root@gb200nvl-bmc:/# echo 0x35 > /sys/bus/i2c/devices/i2c-6/delete_device
root@gb200nvl-bmc:/# i2cdetect -y -q 6
      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
00:          -----0c----- --
10: -- 12 -- 14 -- ----- -- -- --
20: UU -- -- UU -----28----- UU -- -- UU
30: -- -- -- 34 35 --37----- -- -- -- --
40: ----- -- -- -- -- -- 4e --
50: UU -- -- -- ----- -- -- -- --
60: 60 61 -- ----- -- -- --
70: ----- -- -- -- -- --
```

5. Restart the BMC Leak Detector service and re-bind the ADCs.

```
systemctl start xyz.openbmc_project.leakdetectsensor.service
```

6. Getting the ADC readings before applying liquid to the leak detection sensor. Each ADC has two channels.

```
# Primary Bianca AIN0
root@gb200nvl-bmc:/# i2ctransfer -y -f 6 w2@0x34 0x61 0x82 r2
0x98 0x54

# Primary Bianca AIN1
root@gb200nvl-bmc:/# i2ctransfer -y -f 6 w2@0x34 0x63 0x82 r2
0xb8 0x4b

# Secondary Bianca AIN0
root@gb200nvl-bmc:/# i2ctransfer -y -f 6 w2@0x35 0x61 0x82 r2
0x98 0x58

# Secondary Bianca AIN1
root@gb200nvl-bmc:/# i2ctransfer -y -f 6 w2@0x35 0x63 0x82 r2
0xb8 0x4a
```

7. Convert the ADC readings to the voltage value and compare the read-out value from the ADC sensors to the nominal threshold range. Take the lower 12 bits of the reading you got from the previous step and convert it to a decimal value. Then divide it by 4095 and multiply by 3.3 to get the ADC voltage reading.

```
# Primary Bianca AIN0
root@gb200nvl-bmc:/# i2ctransfer -y -f 6 w2@0x34 0x61 0x82 r2
0x98 0x54
```

8. To convert the sensor reading "0x98 0x54" we take the lower 12 bits so that is 0x854 which is equal to 2132. Now divide this by 4095 and multiply by 3.3.  $(2132/4095)*3.3 = 1.718V$

For normalized readings, just divide by 4095 so  $2132/4095 = 0.52$

9. Repeat the same exercise after applying liquid droplets to the leak sensors at the GPU sensor (P4672) and the manifold sensor (P4674). Partners can choose to apply additional leak testing at other sensor locations as well.



**Note:** Care should be taken when applying liquid droplets on the liquid sensors. Partners are recommended to have the leak sensor outside the tray assembly. Apply droplets to the sensor directly to avoid damaging other electronic components.

10. Repeat the same exercise after shorting the sensor through a jump wire and verify that the nominal ADC reading is at or close to zero.
11. Repeat the same exercise when the sensor is unplugged from the compute tray (missing sensor condition). Verify that the nominal ADC reading is at or close to one.

### Pass or Fail Criteria

- > Sensor ADC hardware pass or fail criteria:
  - For a dry, no leakage condition, the normalized ADC reading should be between 0.49 and 0.55.
  - For a leakage condition, the resistance across the sensor will drop resulting in a lower voltage input to the ADC. As a result, the normalized ADC reading should be lower than 0.49.
  - For a sensor short condition, the normalized ADC reading should be at or close to zero.
  - For a missing sensor condition, the normalized ADC reading should be at or close to one.
- > BMC firmware implementation pass or fail criteria:
  - For any fault condition (leakage, short, missing sensor), the amber fault LED on the front panel should be blinking at 4 Hz.
  - The BMC should support DMTF Redfish LeakDetection and LeakDetector schemas. Verify the Redfish event log should be issued by the BMC. The fault event log should be located in the Redfish event log directory, for example:  
`/redfish/v1/Systems/System_0/LogServices/EventLog`

## 3.4 Power Validation

This section details the power validation testing and procedures for the GB200 and GB300 NVL systems.

### 3.4.1 Input EDP Peak Testing

#### Purpose

This test validates that the system can operate under transient workload at various pulsing frequencies and that the input voltage peak-to-peak value is within the values listed in the product specification.

For board level input EDPP specifications, refer to the *NVIDIA GB200 NVL Bianca Compute Board Product Specification* (NVOnline: 1114953) and the *NVIDIA GB300 NVL72 HPM Compute Board Product Specification* (NVOnline: 1126424).

#### Test Procedure

1. Secure probe point for the input power on the 12V bus bar going into the compute board PCBA.
2. Run the NVQual Test #7 (Input EDPP Test)
3. On the 12V voltage rail, capture the instantaneous voltage peak-to-peak value, and compare it against the product specification for peak-to-peak value.
4. To accurately characterize input EDPP transient current and power, there are two measurement methods:

- a. Use the oscilloscope's "**High Res**" feature in the oscilloscope that oversamples and then averages over the acquisition interval. For the board level and compute tray level specifications, the moving average timescales are 400  $\mu$ s and 50 ms. See **Section 6.1 "Moving Average Measurement using Oscilloscope HiRes Mode"** for more information.
- b. Collect data samples over the transient workload and perform post processing for each load frequency with the recommended oscilloscope settings. See **Section 6.2 "Input EDP Post-Processing Guideline"** for more information.
  - i. Sampling rate:
    - (1) Load frequencies < 10 KHz: 100  $\mu$ s/pt is sufficient
    - (2) Load frequencies  $\geq$  10 KHz: 40  $\mu$ s/pt is recommended
  - ii. Record length and horizontal scale:
    - (1) Load frequencies < 10 KHz: 1500 seconds (150 sec/div)
    - (2) Load frequencies  $\geq$  10 KHz: 1200 seconds

#### **Pass or Fail Criteria**

Ensure that the compute tray does not crash when running the workload and that the peak-to-peak voltage noise is within the product's electrical specification.

### **3.4.2 GPU Power Brake**

#### **Purpose**

This test ensures that the partner has the capability to assert the power brake (**PWR BRAKE\_N**) to lessen peak and continuous power draw during emergency situations.

#### **Test Procedure**

1. Ensure that the **PWR BRAKE\_N** signal is in a de-asserted state before the experiment during normal operation.
2. If power brake is implemented for the emergency scenarios, toggle the **PWR BRAKE\_N** external input.
  - a. Assert **PWR BRAKE\_N** for at least 250 ms on the host partner system.
3. Verify total baseboard power gets reduced when **PWR BRAKE\_N** is asserted.
  - a. Run tools such as NVQual thermal test or nvidia-smi for log collection on GPU power levels and GPU clock.

#### **Pass or Fail Criteria**

- > **PWR BRAKE\_N** is in a de-asserted state during normal operation.
- > **PWR BRAKE\_N** is asserted by the host platform for all emergency scenarios.

### 3.4.3 Power Distribution Board Static Loads

#### Purpose

The purpose of this test is to verify that the partner-designed power distribution board (PDB) can supply and sustain power under various static load conditions. For more information, refer to the *GB200 NVL72 Compute Tray Power Distribution Board Design Specification* (NVOnline: 1115461) and the *NVIDIA GB300 NVL72 Power Distribution Board Design Specification* (NVOnline: 1136956).

#### Prerequisites

PDB board with an electronic load to simulate active load.

Some of the GB NVL72 compute tray components are partner designed, and partners are responsible for validating to the power budget of their compute tray. The following test condition and procedure are for partner reference only.

#### Test Procedure

1. Connect electronic load to all power rails **12V\_RUN\_1**, **12V\_RUN\_2**, and **12V\_STBY** on the PDB board with the 54V input power to the PDB.
2. Start from 0% static load and draw power from the PDB.
3. Sweep static load conditions with increasing steps (partner defined, a minimum of 4 to 5 steps), until PDB power protection hot swap controller (HSC) over current protection kicks in.
4. Record input voltage, output voltage, and current for each step.

#### Pass or Fail Criteria

- PDB can sustain all static load conditions before OCP without any issues.
- The PDB board did not sustain any permanent damage after OCP.

### 3.4.4 Power Distribution Board Dynamic Loads

#### Purpose

This test's purpose is to verify that the partner-designed power distribution board (PDB) can supply and sustain power under various dynamic load conditions. For more information, refer to the *GB200 NVL72 Compute Tray Power Distribution Board Design Specification* (NVOnline: 1115461) and the *GB300 NVL72 Power Distribution Board Design Specification* (NVOnline: 1136956).

#### Prerequisites

PDB board with an electronic load to simulate active load.

Some of the GB NVL72 compute tray components are partner designed, partners are responsible for validating to the power budget of their compute tray. The following test condition and procedure are for partner reference only.

## Test Procedure

1. Connect electronic load to all power rails **12V\_RUN\_1**, **12V\_RUN\_2**, and **12V\_STBY** on the PDB board with the 54V input power to the PDB.
2. For the electrical load setting, with 50% duty cycle at 10 ms, 20 ms, and 50 ms at 100% load.
3. Test at 10% duty cycle at maximum load at the board hot swap controller (HSC) over current protection kicks in.
4. Record input voltage, output, voltage and current for each step.

## Pass or Fail Criteria

- > PDB can sustain all load conditions before OCP without any issues.
- > The PDB board did not sustain any permanent damage after OCP.

## 3.5 PCIe

This section details the PCIe testing and procedures for the GB200 and GB300 NVL systems.

### 3.5.1 PCIe Speed and Width Configuration Check

#### Purpose

This test is to verify that all NVIDIA Grace™ PCIe lanes are configured to their expected speed and width.

#### Prerequisites

The following table shows the minimum required hardware configuration for verifying the CPU UPHY PCIe width and speed. Partners can prioritize validating PCIe links that do not require GPUs to ensure faster time to market.

**Table 3-2. PCIe Link Specification and Minimum Hardware Configuration**

| # | PCIe Link                           | Width | Speed | Minimum Hardware Configuration |
|---|-------------------------------------|-------|-------|--------------------------------|
| 1 | Grace 0 UPHY 0 (#1 CX-7/CX-8) (x16) | x16   | Gen5  | Full Tray + GPU-Less           |
| 2 | Grace 0 UPHY 1 (#2 CX-7/CX-8) (x16) | x16   | Gen5  | Full Tray + GPU-Less           |
| 3 | Grace 0 UPHY 2 (PEX SW Gen 3) (x1)  | x1    | Gen3  | Full Tray + GPU-Less           |
| 4 | Grace 0 UPHY 2 (1G NIC) (x1)        | x1    | Gen3  | Full Tray + GPU-Less           |
| 5 | Grace 0 UPHY 2 (USB Bridge) (x1)    | x1    | Gen2  | Full Tray + GPU-Less           |
| 6 | Grace 0 UPHY 2 (DC-SCI) (x1)        | x1    | Gen2  | Full Tray + GPU-Less           |
| 7 | Grace 0 UPHY 3 (BF-3) (x16)         | x16   | Gen5  | Full Tray + GPU-Less           |
| 8 | Grace 0 UPHY 3 (4 x E1.S) (x4)      | x4    | Gen4  | Full Tray + GPU-Less           |
| 9 | Grace 0 UPHY 4 (GPU/CX-8) (x1)      | x1    | Gen4  | Core Tray                      |

| #  | PCIe Link                           | Width | Speed | Minimum Hardware Configuration |
|----|-------------------------------------|-------|-------|--------------------------------|
| 10 | Grace 0 UPHY 5 (GPU/CX-8) (x1)      | x1    | Gen4  | Core Tray                      |
| 11 | Grace 1 UPHY 0 (#1 CX-7/CX-8) (x16) | x16   | Gen5  | Full Tray + GPU-Less           |
| 12 | Grace 1 UPHY 1 (#2 CX-7/CX-8) (x16) | x16   | Gen5  | Full Tray + GPU-Less           |
| 13 | Grace 1 UPHY 2 (M.2 SSD) (x4)       | x4    | Gen4  | Full Tray + GPU-Less           |
| 14 | Grace 1 UPHY 3 (BF-3) (x16)         | x16   | Gen5  | Full Tray + GPU-Less           |
| 15 | Grace 1 UPHY 3 (4 x E1.S) (x4)      | x4    | Gen4  | Full Tray + GPU-Less           |
| 16 | Grace 1 UPHY 4 (GPU/CX-8) (x1)      | x1    | Gen4  | Core Tray                      |
| 17 | Grace 1 UPHY 5 (GPU/CX-8) (x1)      | x1    | Gen4  | Core Tray                      |

### Test Procedure

Run the following from the host and check the speed and width of each device in the output:

```
sudo lspci -vvv
```

### Pass or Fail Criteria

Verify each link conforms to the speed and width listed in Table 3-2.

## 3.5.2 PCIe Link Training Steady State Machine

PCIe Link Training Steady State Machine (LTSSM) should be carried out on all the PCIe links from the Grace CPU. Partners may be able to perform certain links before a fully populated Bianca board with GPUs. The PCIe LTSSM includes the following tests and should be performed on all Grace PCIe links.

Table 3-3. Grace CPU PCIe LTSSM

| Test               | Subtest                    | Iterations | Pass or Fail Criteria                                                                       |
|--------------------|----------------------------|------------|---------------------------------------------------------------------------------------------|
| Link Inoperability | Hot Reset - SBR            | 10000      | Train to device maximum Speed and Width with no Uncorrectable or Correctable Errors logged. |
|                    | Speed Change (Gen1 - Gen2) | 500        | Train to Gen2 x16 No errors                                                                 |
|                    | Speed Change (Gen1 - Gen3) | 500        | Train to Gen3 x16 No errors                                                                 |
|                    | Speed Change (Gen1 - Gen4) | 500        | Train to Gen4 x16 No errors                                                                 |
|                    | Speed Change (Gen1 - Gen5) | 500        | Train to Gen5 x16 No errors                                                                 |
|                    | Speed Change (Gen2 - Gen3) | 500        | Train to Gen3 x16 No errors                                                                 |
|                    | Speed Change (Gen2 - Gen4) | 500        | Train to Gen4 x16 No errors                                                                 |
|                    | Speed Change (Gen2 - Gen5) | 500        | Train to Gen5 x16 No errors                                                                 |
|                    | Speed Change (Gen3 - Gen4) | 500        | Train to Gen4 x16 No errors                                                                 |
|                    | Speed Change (Gen3 - Gen5) | 500        | Train to Gen5 x16 No errors                                                                 |

| Test | Subtest                        | Iterations | Pass or Fail Criteria                                                                       |
|------|--------------------------------|------------|---------------------------------------------------------------------------------------------|
|      | Speed Change (Gen4 - Gen5)     | 500        | Train to Gen5 x16 No errors                                                                 |
|      | Link Disable/Enable            | 10000      | Train to device maximum Speed and Width with no Uncorrectable or Correctable Errors logged. |
|      | Retrain                        | 10000      | Train to device maximum Speed and Width with no Uncorrectable or Correctable Errors logged. |
|      | PowerManagement L1 Transitions | 10000      | Train to device maximum Speed and Width with no Uncorrectable or Correctable Errors logged. |
|      | PowerManagement D3 Transitions | 10000      | Train to device maximum Speed and Width with no Uncorrectable or Correctable Errors logged. |
|      | Tx Eq Redo                     | 10000      | Train to device maximum Speed and Width with no Uncorrectable or Correctable Errors logged. |

### Prerequisites

Partners must use NVQual vG1.0.4 or later

### Test Procedure

1. Download the *NVQual for Grace-Blackwell Product* (NVOnline: 1119931) and follow the instructions within the *NVQual User's Guide*.
2. Select and run the PCIe Link Training test suite under NVQual.
3. Verify the NVQual output test result and check against the minimum hardware configuration in Table 3-2 to make sure they can successfully run the PCIe LTSSM
4. Update the test results on the Partner Validation Playbook.
5. Combine these test logs into a single ZIP file for the final NVQual submission
6. Refer to the *Introduction to NVIDIA Partner Validation Playbook (PVP)* (NVOnline: 1106745) for instructions on test results submission.

### Pass or Fail Criteria

PCIe LTSSM test passes in the NVQual test tool.

## 3.6 I/O Interface

This section details the I/O interface testing and procedures for the GB200 and GB300 NVL systems.

## 3.6.1 CPU JTAG Scan Check from BMC

### Purpose

This test checks the debug JTAG scan from a single Grace CPU to the BMC is functional as expected.

### Prerequisites

Partner should use minimum QS firmware package dev1 drop, and with C05 Bianca board.

### Test Procedure

1. Set the **JTAG\_MUX\_SEL** to active low on the BMC  

```
root@gb200nvl-bmc:~# gpioset `gpiofind "JTAG_MUX_SELECT-0"`=1
```
2. Set the **JTAG\_BYPASS\_B2B** to 1 from the HMC via SMBPBI  

```
i2cset -y 1 0x60 0x5c 0x04 0xb4 0x99 0x56 0x80 i # unlock hidden devices via SMBPBI
```
3. Set the I/O expander (4th bit) to 1 from the HMC  

```
root@gb200nvl-hmc:~# i2ctransfer -y 1 w3@0x0f 0x38 0x02 0x08
```
4. Using the `jtag_test` code attached in this validation guide, and run the test code  

```
root@gb200nvl-bmc:~# ./jtag_test -j /dev/jtag1 -i 0x6ba00477 -f 1000000 -v
```
5. Verify the number of devices found, expected to be 1: Verify the ID code of the found device, expected to be 0x6ba00477  

```
Found number of devices:1
Found ir length devices:4
idcode:0x6ba00477
```



**Note:** To access the attached test code “`jtag_test`,” click the **Attachment** icon on the left-hand toolbar on this PDF (using Adobe Acrobat Reader or Adobe Acrobat). Select the file and use the Tool Bar options (**Open**, **Save**) to retrieve the attachment. File with the `.nvzip` extension must be renamed to `.zip` and then can be extracted using 7-zip file archive software or other archive software.

### Pass or Fail Criteria

CPU JTAG can successfully complete the boundary scan from the BMC. The number of devices and device ID code matches the expected values.

## 3.6.2 Dual CPU JTAG Scan Check from BMC

### Purpose

This test checks the debug JTAG scan from two Grace CPUs to the BMC is functional as expected.

### Prerequisites

Partner should use minimum QS firmware package dev1 drop, and with C05 Bianca board.

## Test Procedure

- Set the **JTAG\_MUX\_SEL** to active low on the BMC

```
root@gb200nvl-bmc:~# gpioset `gpiofind "JTAG_MUX_SELECT-0"`=1
```

- Set the **JTAG\_BYPASS\_B2B** to 1 from the HMC

```
i2cset -y 1 0x60 0x5c 0x04 0xb4 0x99 0x56 0x80 i # unlock hidden devices via SMBPBI on Primary Board
```

```
i2cset -y 2 0x60 0x5c 0x04 0xb4 0x99 0x56 0x80 i # unlock hidden devices via SMBPBI on Secondary Board
```

- Set the I/O expander (4th bit) to 1 on the secondary board

```
root@gb200nvl-hmc:~# i2ctransfer -y 2 w3@0x0f 0x38 0x02 0x08
```

- Using the `jtag_test` code attached in this validation guide, and run the test code

```
root@gb200nvl-bmc:~# ./jtag_test -j /dev/jtag1 -i 0x6ba00477 -f 1000000 -v
```

- Verify the number of devices found, expected to be 2. Verify the ID code of the found device, expected to be 0x6ba00477)

```
Found number of devices:2
```

```
Found ir length devices:8
```

```
idcode:0x6ba00477
```

```
idcode:0x6ba00477
```



**Note:** To access the attached test code “`jtag_test`,” click the **Attachment** icon on the left-hand toolbar on this PDF (using Adobe Acrobat Reader or Adobe Acrobat). Select the file and use the Tool Bar options (**Open**, **Save**) to retrieve the attachment. File with the `.nvzip` extension must be renamed to `.zip` and then can be extracted using 7-zip file archive software or other archive software.

## Pass or Fail Criteria

CPU JTAG can successfully complete the boundary scan from the BMC. The number of devices and device ID code matches the expected values.

### 3.6.3 CPU JTAG Electrical Validation

#### Purpose

This test checks the debug JTAG electrical signal meets the JTAG timing requirements as referenced in the Arm developer RealView ICE.

**Figure 3-1.** CPU JTAG Electrical Validation**Table 3-4.** JTAG Timing Specification

| Symbol | Parameter                               | Min.    | Max.        |
|--------|-----------------------------------------|---------|-------------|
| Tbscl  | TCK LOW period                          | 50 ns   | 500 $\mu$ s |
| Tbsch  | TCK HIGH period                         | 50 ns   | 500 $\mu$ s |
| Tbsod  | TDI and TMS valid from TCK falling edge | --      | 6.0 ns      |
| Tbsis  | TDO setup to TCK (rising)               | 15.0 ns | --          |
| Tbsih  | TDO hold from TCK (rising)              | 6.0 ns  | --          |

### Prerequisites

Partner should use minimum QS firmware package dev1 drop, and with C05 Bianca board.

### Test Procedure

1. Set the CPU ready for JTAG scan chain as in Section 3.6.1.
2. Install probe points at the BMC board for JTAG signals (TDI, TDO, TCK, and TMS)
3. Capture waveforms when the `jtag_test` code is executed.

### Pass or Fail Criteria

Verify that the captured JTAG signals meet the timing specification listed in Table 3-4.

## 3.6.4 I2C Functional Bus Scan

### Purpose

This test ensures that the host BMC can access all other I2C buses in the compute tray as outlined in the *NVIDIA GB200 Bianca NVL Product Specification* (NVOnline: 1114953).

### Prerequisites

- > Set up the partner's Host BMC or equivalent I2C host device with access to the I2C buses.
- > Identify which I2C addresses need to be accessed within the host I2C device.
- > Ensure that your I2C host device has the correct parameters set:
  - Bitrate
  - Slave address
  - Disable internal I2C pull-up resistors within the host I2C device

### Test Procedure

Perform a read on the host I2C device with connections to SDA and SCL lines. For example, systems with Linux packages such as I2C tools installed on the BMC or HMC can utilize the following commands for scanning:

```
I2C1 scan: i2cdetect -y -q $I2C1_BUS
I2C2 scan: i2cdetect -y -q $I2C2_BUS
```

### Pass or Fail Criteria

All device I2C addresses are visible through the I2C bus scan.

## 3.6.5 I2C Electrical Validation

### Purpose

This test ensures the I2C management interface meets the official I2C electrical specifications (<https://www.nxp.com/docs/en/user-guide/UM10204.pdf>).

### Prerequisites

- > Set up oscilloscope with passive probes (1 MHz bandwidth per channel or higher)
- > Ensure direct probing access to I2C buses highlighted in the *NVIDIA GB200 NVL Bianca Compute Board Product Specification* (NVOnline: 1114953). This may require soldering break-out wires to the net name locations outside of the baseboard.
- > Set up host BMC or equivalent I2C host device
  - Establish a method of sending I2C traffic to the host BMC. For example, set up polling to the I2C bus or develop a script to read data from an I2C endpoint through OOB methods.
- > Ensure that your I2C host device has the correct parameters set:
  - Bitrate
  - Slave address
  - Register address
  - Disable internal I2C pull-up resistors within the host I2C device

### Test Procedure

1. Hook up the oscilloscope and passive probes (2x) to the SDA and SCL lines on the I2C bus.

2. Initiate I2C read/write traffic with host I2C device.
3. Collect I2C waveforms for each I2C electrical validation parameter.

#### **Pass or Fail Criteria**

The measurements performed on the I2C buses must meet the official electrical specifications by Philips/NXP. I2C parameters should be verified:

- > V<sub>OL</sub> (Output Low Voltage)
- > V<sub>OH</sub> (Output High Voltage)
- > Clock Frequency
- > Rise Time
- > Fall Time
- > High Time
- > Low Time
- > Setup Time
- > Hold Time
- > Hold Time – Start
- > Setup Time – Stop

### **3.6.6 SPI Functional Check**

#### **Purpose**

Verify that SPI devices can be accessed and communicated properly.

#### **Test Procedure**

1. The example test procedure checks for BMC connection to the BMC ERoT (EID: 0) through MCTP on the SPI bus.
2. Use the following command to check for BMC ERoT is listed on the MCTP tree
 

```
root@dev:~# busctl tree xyz.openbmc_project.MCTP.Control.SPI
`-/xyz
  `-/xyz/openbmc_project
    `-/xyz/openbmc_project/mctp
      `-/xyz/openbmc_project/mctp/0
        `-/xyz/openbmc_project/mctp/0/0
```
3. SPI functionalities can be verified through the successful retrieval of the MCTP information (EIDs) on these ERoTs.
4. Use the following command to check for BMC to HMC ERoT by performing a HMC recovery from the BMC by SPI.

```
## From BMC Terminal ##
systemctl start hmc-recovery.target

busctl list | grep -i mctp    ## Verify SPI2 is available

busctl list | grep -i mctp
```

```

:1.179                                         2347 mctp-ctrl      root
:1.179      mctp-spi0-ctrl.service          -
:1.182                                         2403 mctp-ctrl      root
:1.182      mctp-i2c2-ctrl.service          -
:1.358                                         9600 mctp-ctrl      root
:1.358      mctp-i2c14-ctrl.service          -
:1.359                                         9601 mctp-ctrl      root
:1.359      mctp-i2c15-ctrl.service          -
:1.360                                         9602 mctp-ctrl      root
:1.360      mctp-i2c5-ctrl.service          -
:1.502                                         10310 mctp-ctrl     root
:1.502      mctp-spi2-ctrl.service          -
xyz.openbmc_project.MCTP.Control.SMBus14      9600 mctp-ctrl      root
:1.358      mctp-i2c14-ctrl.service          -
xyz.openbmc_project.MCTP.Control.SMBus15      9601 mctp-ctrl      root
:1.359      mctp-i2c15-ctrl.service          -
xyz.openbmc_project.MCTP.Control.SMBus2       2403 mctp-ctrl      root
:1.182      mctp-i2c2-ctrl.service          -
xyz.openbmc_project.MCTP.Control.SMBus5       9602 mctp-ctrl      root
:1.360      mctp-i2c5-ctrl.service          -
xyz.openbmc_project.MCTP.Control.SPI0         2347 mctp-ctrl      root
:1.179      mctp-spi0-ctrl.service          -
xyz.openbmc_project.MCTP.Control.SPI2         10310 mctp-ctrl     root
:1.502      mctp-spi2-ctrl.service          -

busctl tree xyz.openbmc_project.MCTP.Control.SPI2 ## Verify EID 1 Exists in the SPI2 Tree
`- /xyz
  `- /xyz/openbmc_project
    `- /xyz/openbmc_project/mctp
      |- /xyz/openbmc_project/mctp/0
      | ` - /xyz/openbmc_project/mctp/0/1
      `- /xyz/openbmc_project/mctp/SPI

```

### Pass or Fail Criteria

Expected device EIDs show up on the MCTP SPI tree confirming the functionality of the SPI buses.

### 3.6.7 SPI Electrical Validation

#### Purpose

Verify that the SPI lines (MISO, MOSI, SCLK, SS/CS) on the partner BMC boards meet the SPI electrical specifications in terms of voltage levels, timing, and signal integrity. The SPI electrical signal should be verified against its timing specification outlined in Table 3-5.

**Figure 3-2. SPI Electrical Validation**



**Table 3-5. SPI Timing Parameters**

| Symbol             | Parameter               | Unit | Min.              | Max.                                 |
|--------------------|-------------------------|------|-------------------|--------------------------------------|
| t <sub>c</sub> kh  | SCLK High Time          | ns   | 8.6ns             | t <sub>C</sub> K – t <sub>C</sub> KL |
| t <sub>c</sub> kl  | SCLK Low Time           | ns   | 8.6ns             | t <sub>C</sub> K – t <sub>C</sub> KH |
| t <sub>o</sub>     | Data Output Propagation | ns   | t <sub>DO</sub> H | 13ns                                 |
| t <sub>do</sub> h  | Data Output Hold time   | ns   | 2                 |                                      |
| t <sub>d</sub> is  | Data Input Setup Time   | ns   | 7.6               | 3000                                 |
| t <sub>d</sub> ih  | Data Input Hold Time    | ns   | 4                 | --                                   |
| t <sub>cs</sub> s  | CS Input Setup Time     | ns   | 19.6              | --                                   |
| t <sub>cs</sub> hr | CS Input Hold Time      | ns   | 9.8               | --                                   |

#### Prerequisites

Equipment: Oscilloscope, signal generator, logic analyzer, and appropriate probes.

## Test Procedure

1. For the compute tray, the SPI voltage levels ( $V_{IH}$ ,  $V_{IL}$ ,  $V_{OH}$ , and  $V_{OL}$ ) are configured to VDD at 3.3V. Refer to the device-specific data sheet for the DC electrical characteristics.
2. Connect the oscilloscope probes to the SPI lines (DIN, DOUT, SCLK, SS/CS).
  - a. For checking the BMC to ERoT SPI path on the reference design, utilize the QSPIO lines from the BMC to ERoT Microchip CEC1736.
  - b. For checking the BMC to HMC ERoT SPI path on the reference design, utilize the J21 connector on the HMC and measure the following bus nets:  
**EROT\_OOB\_BMC\_CLK\_MC**, **EROT\_OOB\_BMC\_CS0\_MC**, **EROT\_OOB\_BMC\_MOSI\_MC**, and **EROT\_OOB\_MISO\_MC**.



3. Monitor the voltage levels for logic high and logic low on each line.
4. Compare with the required voltage levels as per SPI specifications.
5. Capture the timing of the SPI signals using the oscilloscope.
6. Measure setup time, hold time, clock frequency, and clock-to-output delay.
7. Verify that the timing parameters meet the SPI specification.
8. Observe the SPI signals for any anomalies like noise, overshoot, or ringing.

## Pass or Fail Criteria

- > Signals must be clean without significant distortions or deviations from the expected waveform.
- > Voltage levels must fall within the specified range for both logic high and low.
- > Timing parameters must align with the SPI standard's specifications.

## 3.6.8 SGPIO Electrical Validation

The SGPIO communication protocol allows the FPGA on the Bianca compute board to communicate seamlessly across its accessory boards (BMC, HCM, and IPEX). It is also being used to feed various serialized signals to the FPGA using a parallel-to-serial shift buffer topology. For more information on SGPIO implementation on the compute board, refer to the *NVIDIA GB200 NVL Bianca Compute Board Product Specification* (NVOnline: 1114953).

The SGPIO bus consists of four signal lines and is shared between two devices, usually called initiator and target. It is an open collector configuration and typically has 2.0 k $\Omega$  pull-up resistors.

**Figure 3-3. SGPIO Bus Overview**



The following figure shows the timing parameters for SGPIO interface between the FPGA and BMC management module.

**Figure 3-4. SGPIO Timing Diagram**



These are the signals that make up the SGPIO bus:

- > **SClock:** The SGPIO interface from the Bianca FPGA is running at 12.5 MHz.
- > **SLoad:** Indicates a new frame of data and is synchronous to the clock. A new SGPIO frame is indicated by SLoad being high at a rising edge of a clock after having been low for at least five clock cycles. The four falling clock edges after a start condition are used to carry a 4-bit value from initiator to target, or vice versa.
- > **SDataOut:** Used by initiator to send data to target. Consists of 4-bit data packets
- > **SDataIn:** Used by target to send data back to initiator. Consists of 4-bit data packets.

**Table 3-6. SGPIO Timing Parameters**

| Symbol           | Parameter            | Minimum | Maximum | Unit |
|------------------|----------------------|---------|---------|------|
| f <sub>ck</sub>  | Clock frequency      | -       | 12.5    | MHz  |
| t <sub>ldv</sub> | LD valid delay       | -       | 5       | ns   |
| t <sub>ov</sub>  | DATA_OUT valid delay | -       | 5       | ns   |

| Symbol          | Parameter        | Minimum | Maximum | Unit |
|-----------------|------------------|---------|---------|------|
| t <sub>IS</sub> | Input setup time | 10      | -       | ns   |
| t <sub>IH</sub> | Input hold time  | 1       | -       | ns   |
| T <sub>R</sub>  | Input rise time  | -       | 6       | ns   |
| T <sub>F</sub>  | Input fall time  | -       | 6       | ns   |

## Test Procedure

- Find SGPIO signals on the compute tray (Bianca) schematic.



PLACE R83, R5896 CLOSE TO FPGA. PLACE R5897 CLOSE TO U739

- Determine a probing spot for each signal. See the following examples on the reference design.
  - The first set of signals already gives good test points. Measure LD and CLK near the FPGA at R83 and R5896, respectively. Measure data in near parallel to serial topology, or in this case R5897 which is near U739.
  - For the second set of signals, it is recommended that CLK, LD, and DATAOUT be measured near the FPGA. The image in Step 1 shows appropriate probing spots for these. DATAIN is to be measured as close as possible to the HMC. In this case, the HMC is the initiator, and the FPGA is the target.
- Locate probing spots on the layout file. Rework to add probes to necessary spots or acquire probe holders for testing the signals.



**Note:** If probes are to be added, make sure they are as short as possible so as not to introduce stubs to the physical layer of the communication bus and ensure proper grounding.

- Set up an oscilloscope with the appropriate signals to measure the timing and electrical properties of the bus and make sure they are in specification. Using the CLK or LD line as the scope's trigger would be best.



- Generate traffic and record multiple screenshots and waveform files. Make sure to record all necessary measurements, such as clock frequency, clock high and low time, rise and fall times, data setup, and hold times.

#### **Pass or Fail Criteria**

Check the captured signals against the PHY layer and protocol layer characteristics and verify against the SGPIO timing parameter provided in this section.

### **3.6.9     UART Functional Check**

#### **Purpose**

UART is crucial for debugging the compute tray. Engineers and developers can directly tap into a device with a UART controller using a host PC if UART is functionally working. Partners should validate that the UART interfaces are functional and meet electrical specifications.

#### **Test Procedure**

- Identify the UART interface on the BMC module. This UART interface connects directly to the Grace CPU for debugging purposes.

- Identify the UART signals (TX, RX and GND lines) on the BMC, and connect the UART interface to a lab PC using a USB to TTL UART cable (example: <https://www.adafruit.com/product/954>)



- Verify that serial communication can be established between the UART controller and the lab PC. Partner can use a serial console terminal (that is, PuTTY, TeraTerm, or MobaXTerm) on the lab PC to confirm.
- Set the baud rate to match the baud rate (that is, 921600 bps) of the UART controller to ensure functional data transmission.



### Pass or Fail Criteria

Ensure the UART serial communication is established between the lab PC and the BMC.

## 3.6.10 UART Electrical Validation

### Purpose

Verify the UART interface between the BMC and the HMC CPLD meets the electrical specifications of the device data sheet.

**Table 3-7.** UART DC Specifications

| Parameter                            | Symbol   | Min  | Typ | Max | Unit |
|--------------------------------------|----------|------|-----|-----|------|
| Input low voltage                    | $V_{IL}$ | -0.3 | -   | 0.8 | V    |
| Input high voltage                   | $V_{IH}$ | 2.0  | -   | 3.6 | V    |
| Output low voltage @ $I_{OL}$ (min)  | $V_{OL}$ | -    | -   | 0.4 | V    |
| Output high voltage @ $I_{OH}$ (min) | $V_{OH}$ | 2.4  | -   | -   | V    |

**Figure 3-5.** UART  $T_{BIT}$  Definition**Table 3-8.** UART AC Specifications

| Parameter              | Symbol    | Min | Typ | Max    | Unit      |
|------------------------|-----------|-----|-----|--------|-----------|
| Baud Rate              | -         | -   | -   | 3.6864 | MBits/s   |
| Bit Time               | $T_{BIT}$ | 271 | -   | -      | ns        |
| Input/Output Rise Time | $T_R$     | -   | -   | 0.05   | $T_{BIT}$ |
| Input/Output Fall Time | $T_F$     | -   | -   | 0.05   | $T_{BIT}$ |



**Note:** The procedure outlines the UART parameters based on the reference BMC and HMC device AST2600. Refer to the specific SoC data sheet for the corresponding parameters.

### Prerequisites

- Oscilloscope: Tektronix TDS684C or faster
- At least one (preferably four) Tektronix P6245 Single-ended Probes (with dynamic range of 7+ volts)
- USB to TTL UART cable

### Test Procedure

1. Identify the UART TX, RX pin location on the MICO connector between the HMC and BMC board
2. Send UART signals between via the BMC to the HMC CPLD, and capture waveforms on scope.
3. Check the captured waveform against the DC and AC electrical specifications with the respective device data sheet, or as outlined in Table 3-7 and Table 3-8.

### Pass or Fail Criteria

Ensure the UART waveform captured meets both the DC and AC electrical specifications as outlined.

## 3.6.11 RMII Network Controller Sideband Interface

NC-SI, abbreviated from network controller Sideband interface, is an electrical interface and protocol defined by the Distributed Management Task Force (DMTF). The NC-SI enables the connection of a baseboard management controller (BMC) to one or more network interface controllers (NICs) in a server computer system to enable out-of-band system management. This allows the BMC to use the network connections of the NIC ports for the management traffic, in addition to the regular host traffic.

NC-SI is based on the RMII specification with some modifications that allow the connection of multiple network controllers to a single BMC.

The NC-SI port allows OOB management to the BMC and BF-3 from the BF-3's QSFP network. It may be seen that this link allows the BMC to talk directly to the BF-3 over IP, but this is not the case. The BMC's eth1 link is forwarded directly to the external QSFP network.

### Purpose

Verify the RMII NC-SI communication between the BMC and BF-3 is functional without any package lost.

### Prerequisites

Minimum hardware requirements:

- > 1x compute board (Bianaca or Arial)
- > 1x BF-3
- > 1x device with an ethernet supported QSFP port

**Figure 3-6. RMII NC-SI Diagram**

**Note:** The diagram in Figure 3-6 ignores the physical path from the BMC to the HMC, BMC to the compute motherboard, and BMC to the front I/O board. Also, the interfaces with a 10.20.45.XX IPs are examples. These IPs will differ from setup to setup.

## Test Procedure

### Step 1: Hardware Configuration

1. Set up two compute trays, each having a BMC, HMC, BF-3, and I/O board.
2. Confirm that the BMCs and HMCs are properly connected.
3. Confirm that the compute tray motherboard to the BF-3 cable is present on both systems.
4. Connect a QSFP cable between the two BF-3s. The QSFP ports will be assigned an IP. This validation guide assumes that the QSFP port closest to the RJ45 port is used on the BF-3. The other ports can be used; the network interface name will be different when assigning the IPs later.
5. Turn on RUN power.
6. For the QSFP ports with a cable installed, confirm their status LEDs are solid green.

## Step 2: Connect to the BF-3s on Host OS.

The BF-3 has two hosts:

- Host OS
- Host BMC. For testing NC-SI, the Host OS is utilized.

There are two paths for connecting the BF-3 to the Host OS:

- Through the TH500 Grace CPU
- Through the BF-3 OOB network interface (RJ45 port).

Use one of the connection methods for executing commands on the BF-3 Host OS.

- Method 1 – Connect BF-3 Host OS through TH500 Grace CPU

1. SSH to the Grace OS.
2. Output the IPv6 address of the BF-3.

```
ip -6 addr | grep tmfifo_net
```

Example output:

```
7: tmfifo_net0: <BROADCAST, MULTICAST, UP, LOWER_UP> mtu 1500 state UNKNOWN qdisc UNKNOWN  
    inet6 fe80::21a:caff:fe:ff02/64 scope link
```

When two BF-3s are installed, there will be two tmfifo\_net interfaces, one for each BF3.

3. Convert the IPv6 address to the address for the Host OS by replacing the last nibble with a '1'. For example:

```
fe80::21a:caff:fe:ff02    fe80::21a:caff:fe:ff01
```

4. From the TH500, SSH to the BF-3 Host OS using the following format:

```
ssh -6 ubuntu@<BF3 Host OS IPv6 address>%<BF3 network interface>
```

Example command:

```
ssh -6 ubuntu@fe80::21a:caff:fe:ff01%tmfifo_net0
```

If you are using the second BF-3, then use tmfifo\_net1 as the interface.

- Method 2 – Connect BF-3 Host OS through OOB interface

- Take a picture of the BF's part and serial number sticker. Record the MAC address with the label "OOB". This refers to the BF's Host OS.



- Open a shell from a system on the same subnet as the BF-3. You will need to be able to install nmap on this system. It is recommended to use the TH500's OS.
- Confirm namp is installed on the system  

```
apt install nmap
```
- Use the OOB MAC address, and run the following commands  

```
MAC='9c:63:c0:7d:21:00' # REPLACE WITH YOUR MAC IN THE SAME FORMAT
NETWORK=`ip addr show enP5p9s0 | grep 'inet' | awk '{print $2}'``
sudo nmap -sn $NETWORK | grep -iB2 $MAC
```
- The output should contain the IP address  

```
Nmap scan report for 10.20.45.97
Host is up (0.000077s latency).
MAC Address: 9c:63:c0:7d:21:00 (Unknown)
```
- SSH to this IP address.

### Step 3: Network Configuration

- Connect to the BMC shell and check the status of the NC-SI link:  

```
dmesg | grep -I ncsi
```
- If the output matches the following, reboot the BMC by running reboot, and try the last step again.  

```
[    1.877450] ftgmac100 1e670000.ftgmac: Using NCSI interface
[   48.578827] ftgmac100 1e670000.ftgmac eth1: NCSI: No channel found to configure!
[   49.621217] ftgmac100 1e670000.ftgmac eth1: Wrong NCSI state 0x100 in workqueue
```
- If the output includes the following, it is okay to continue.  

```
[    1.877659] ftgmac100 1e670000.ftgmac: Using NCSI interface
[   25.683782] ftgmac100 1e670000.ftgmac eth1: NCSI: Handler for packet type 0x82 returned -19
[   41.289146] ftgmac100 1e670000.ftgmac eth1: NCSI: 'bad' packet ignored for type 0x8b
```
- Run ifconfig eth1 and confirm that RX and TX packet count are non-zero:  

```
root@gb200nvl-bmc:~# ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 02:7A:4E:32:03:A7
```

```

inet addr:169.254.150.148 Bcast:169.254.255.255 Mask:255.255.0.0
inet6 addr: fe80::7a:4eff:fe32:3a7/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:82 errors:0 dropped:0 overruns:0 frame:0
TX packets:149 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:6897 (6.7 KiB) TX bytes:9588 (9.3 KiB)
Interrupt:40

```

5. On the first system's BMC, assign the eth1 interface to a new IP on a new network:

```
ifconfig eth1 192.168.1.6 netmask 255.255.255.0
```

6. On the second system's BMC, assign the eth1 interface to a new IP on the same network as the previous step:

```
ifconfig eth1 192.168.1.7 netmask 255.255.255.0
```

7. On the first system's BF-3's Host OS, assign the interface on the OSFP link with a new IP on the same network as the previous step:

```
sudo ifconfig enp3s0f0s0 192.168.1.2 netmask 255.255.255.0
```

8. On the second system's BF-3 Host OS, assign the interface on the QSFP link with a new IP on the same network as the previous step:

```
sudo ifconfig enp3s0f0s0 192.168.1.3 netmask 255.255.255.0
```

These IPs are temporarily configured. If a system is rebooted, then its IP for this setup will be reassigned.

#### **Step 4: Test the links**

1. On the first BF-3 Host OS, ping the second BMC using the 192.168.1.X network:  
ping 192.168.1.7
2. On the second BF-3 Host OS, ping the first BMC using the 192.168.1.X network:  
ping 192.168.1.6
3. Both ping commands should output 0% package lost.

#### **Pass or Fail Criteria**

Both BF-3s can successfully ping the BMCs successfully with 0% package lost.

### **3.6.12 RMII NC-SI Electrical Validation**

#### **Purpose**

Verify the RMII NC-SI interface between the BMC and the BF-3 and NIC meets the electrical specifications of the RMII NC-SI electrical specification. The RMII NC-SI electrical specifications are outlined in the following figures.

**Figure 3-7. RMII NC-SI DC Specifications**

| Parameter                                                  | Symbol                   | Conditions                                                 | Minimum | Typical | Maximum   | Units   |
|------------------------------------------------------------|--------------------------|------------------------------------------------------------|---------|---------|-----------|---------|
| IO reference voltage                                       | $V_{ref}$ <sup>[a]</sup> |                                                            | 3.0     | 3.3     | 3.6       | V       |
| Signal voltage range                                       | $V_{abs}$                |                                                            | -0.300  |         | 3.765     | V       |
| Input low voltage                                          | $V_{il}$                 |                                                            |         |         | 0.8       | V       |
| Input high voltage                                         | $V_{ih}$                 |                                                            | 2.0     |         |           | V       |
| Input high current                                         | $I_{ih}$                 | $V_{in} = V_{ref} = V_{ref,max}$                           | 0       |         | 200       | $\mu A$ |
| Input low current                                          | $I_{il}$                 | $V_{in} = 0 V$                                             | -20     |         | 0         | $\mu A$ |
| Output low voltage                                         | $V_{ol}$                 | $I_{ol} = 4 mA, V_{ref} = \text{min}$                      | 0       |         | 400       | mV      |
| Output high voltage                                        | $V_{oh}$                 | $I_{oh} = -4 mA, V_{ref} = \text{min}$                     | 2.4     |         | $V_{ref}$ | V       |
| Clock midpoint reference level                             | $V_{ckm}$                |                                                            |         |         | 1.4       | V       |
| Leakage current for output signals in high-impedance state | $I_z$                    | $0 \leq V_{in} \leq V_{ref}$<br>at $V_{ref} = V_{ref,max}$ | -20     |         | 20        | $\mu A$ |

[a]  $V_{ref}$  = Bus high reference level (typically the NC-SI logic supply voltage). This parameter replaces the term *supply voltage* because actual devices may have internal mechanisms that determine the operating reference for the NC-SI that are different from the devices' overall power supply inputs.

$V_{ref}$  is a reference point that is used for measuring parameters (such as overshoot and undershoot) and for determining limits on signal levels that are generated by a device. In order to facilitate system implementations, a device shall provide a mechanism (for example, a power supply pin, internal programmable reference, or reference level pin) to allow  $V_{ref}$  to be set to within 20 mV of any point in the specified  $V_{ref}$  range. This approach enables a system integrator to establish an interoperable  $V_{ref}$  level for devices on the NC-SI.

### Figure 3-8. RMII NC-SI AC Specifications



Figure 18 – AC measurements

Table 121 provides AC specifications.

Table 121 – AC specifications

| Parameter                                                                                | Symbol                              | Minimum | Typical | Maximum    | Units |
|------------------------------------------------------------------------------------------|-------------------------------------|---------|---------|------------|-------|
| REF_CLK Frequency                                                                        |                                     |         | 50      | 50+100 ppm | MHz   |
| REF_CLK Duty Cycle                                                                       |                                     | 35      |         | 65         | %     |
| Clock-to-out <sup>[a]</sup><br>(10 pF ≤ C <sub>load</sub> ≤ 50 pF)                       | T <sub>co</sub>                     | 2.5     |         | 12.5       | ns    |
| Skew between clocks                                                                      | T <sub>skew</sub>                   |         |         | 1.5        | ns    |
| TXD[1:0], TX_EN, RXD[1:0], CRS_DV, RX_ER, and ARB_IN data setup to REF_CLK rising edge   | T <sub>su</sub>                     | 3       |         |            | ns    |
| TXD[1:0], TX_EN, RXD[1:0], CRS_DV, RX_ER, and ARB_OUT data hold from REF_CLK rising edge | T <sub>hd</sub>                     |         |         |            | ns    |
| Signal Rise/Fall Time                                                                    | T <sub>r</sub> /T <sub>f</sub>      | 0.5     |         | 6          | ns    |
| REF_CLK Rise/Fall Time                                                                   | T <sub>dref</sub> /T <sub>ckf</sub> | 0.5     |         | 3.5        | ns    |
| Interface Power-Up High-Impedance Interval                                               | T <sub>pwrz</sub>                   | 2       |         |            | μs    |
| Power Up Transient Interval<br>(recommendation)                                          | T <sub>pwt</sub>                    |         |         | 100        | ns    |
| Power Up Transient Level (recommendation)                                                | V <sub>pwt</sub>                    | -200    |         | 200        | mV    |
| Interface Power-Up Output Enable Interval                                                | T <sub>pwe</sub>                    |         |         | 10         | ms    |
| EXT_CLK Startup Interval                                                                 | T <sub>clkstr</sub>                 |         |         | 100        | ms    |

[a] This timing relates to the output pins, while T<sub>su</sub> and T<sub>hd</sub> relate to timing at the input pins.

## Test Procedure

1. Set up two compute trays, each having a BMC, HMC, BF-3, and I/O board.
2. Confirm that the BMCs and HMCs are properly connected.

Determine a probing spot for each signal for NC-SI. For example, the NC-SI signals can be identified on the BMC schematics (NVOnline: 1115463 BMC Management Board Design Kit). Partner should check their own BMC designs to identify these signals. Cross-check the layout file for an appropriate test spot.



| Signal   | Description                                                                       |
|----------|-----------------------------------------------------------------------------------|
| REF_CLK  | 50 MHz clock reference for receive, transmit and control interface                |
| CRS_DV   | Carrier sense and receive data validity for the traffic sent from one of the NICs |
| RXD[1:0] | Receive data (from the NIC to the BMC)                                            |
| TX_EN    | Transmit enable and data validity for the traffic sent from the BMC               |
| TXD[1:0] | Transmit data (from the BMC to the NIC)                                           |
| RX_ER    | Receive error signal, sent from the NIC to the BMC (optional)                     |
| ARB_IN   | Input data hardware arbitration (optional)                                        |
| ARB_OUT  | Output data hardware arbitration (optional)                                       |

3. Submit a rework to add probes to the necessary spots or acquire probe holders for testing the signals.
4. Set up an oscilloscope with the appropriate signals to measure the DC electrical, timing properties of the NC-SI bus and make sure they are within DC and AC specifications.

## Pass or Fail Criteria

Ensure the NC-SI waveform captured meets both the DC and AC electrical specifications as outlined.

## 3.7 Networking I/O Board

This section details the networking I/O board for the NVIDIA GB200 and GB300 NVL72 systems.

### 3.7.1 CX-7 and CX-8 Mezzanine Network Card Ethernet Validation

This section contains steps to set up and run the I/O stress write test on Ethernet protocol through OSFP medium from the CX-7 and CX-8 ports on the networking I/O board.

Partner is recommended to use the **ib\_write\_bw** on **PerfTest** package tool.

#### Test Procedure

Following are the steps for Initial setup.

1. Run **mst start**
2. Check **mst status -v**
3. Verify **mlx\_0** and **mlx\_1** are listed as follows with **mst device 'mt4129'**

| DEVICE_TYPE       | MST                         | PCI          | RDMA   | NET             | NUMA |
|-------------------|-----------------------------|--------------|--------|-----------------|------|
| BlueField3(rev:1) | /dev/mst/mt41692_pciconf0.1 | 0006:01:00.1 | mlx5_3 | net-ibP6s6f1    | 0    |
| BlueField3(rev:1) | /dev/mst/mt41692_pciconf0   | 0006:01:00.0 | mlx5_2 | net-ibP6s6f0    | 0    |
| ConnectX7(rev:0)  | /dev/mst/mt4129_pciconf1    | 0002:03:00.0 | mlx5_1 | net-enP2p3s0np0 | 0    |
| ConnectX7(rev:0)  | /dev/mst/mt4129_pciconf0    | 0000:03:00.0 | mlx5_0 | net-enp3s0np0   | 0    |

Following are the steps for the Ethernet test.

1. Assign ip for **mlx5\_0** and **mlx5\_1**

```
sudo ifconfig enP2p3s0np0 11.11.11.7 // any arbitrary IP
sudo ifconfig enp3s0np0 11.11.11.8 // any arbitrary IP
```
2. Check **ibstat**, verify port state as "Active;" verify link layer as "Ethernet."
3. Follow the commands as shown here to start the server and client and verify that the throughput speed is around 740 Gb/s (as highlighted in green in the following output).

```
nvidia@localhost:~$ ib_write_bw -d mlx5_0 --report_gbts -b --run_ininitely &
[1] 7860
nvidia@localhost:~$ WARNING: BW peak won't be measured in this run.

*****
* Waiting for client to connect... *
*****
```

```
nvidia@localhost:~$ ib_write_bw -d mlx5_1 --report_gbits -b --run_ininitely 11.11.11.8 &
nvidia@localhost:~$ WARNING: BW peak won't be measured in this run.
```

---

```
                    RDMA_Write Bidirectional BW Test
Dual-port      : OFF          Device       : mlx5_0
Number of qps   : 1           Transport type : IB
Connection type : RC         Using SRQ     : OFF
PCIe relax order: ON
```

---

```
                    RDMA_Write Bidirectional BW Test
Dual-port      : OFF          Device       : mlx5_1
Number of qps   : 1           Transport type : IB
Connection type : RC         Using SRQ     : OFF
PCIe relax order: ON
ibv_wr* API     : ON
TX depth       : 128
CQ Moderation  : 1
Mtu            : 1024[B]
Link type      : Ethernet
GID index      : 3
Max inline data: 0[B]
rdma_cm QPs    : OFF
Data ex. method: Ethernet
```

---

```
ibv_wr* API     : ON
TX depth       : 128
CQ Moderation  : 1
Mtu            : 1024[B]
Link type      : Ethernet
GID index      : 3
Max inline data: 0[B]
rdma_cm QPs    : OFF
Data ex. method: Ethernet
```

---

```
local address: LID 0000 QPN 0x0129 PSN 0x6032f RKey 0x203ebd VAddr 0x00c9bfb01f0000
GID: 00:00:00:00:00:00:00:00:255:255:11:11:11:07
local address: LID 0000 QPN 0x0129 PSN 0x4b191c RKey 0x203ebd VAddr 0x00b9b586c20000
GID: 00:00:00:00:00:00:00:00:255:255:11:11:11:08
remote address: LID 0000 QPN 0x0129 PSN 0x6032f RKey 0x203ebd VAddr 0x00c9bfb01f0000
GID: 00:00:00:00:00:00:00:00:255:255:11:11:11:07
remote address: LID 0000 QPN 0x0129 PSN 0x4b191c RKey 0x203ebd VAddr 0x00b9b586c20000
GID: 00:00:00:00:00:00:00:00:255:255:11:11:11:08
```

---

| #bytes | #iterations | BW peak[Gb/sec] | BW average[Gb/sec] | MsgRate[Mpps] |
|--------|-------------|-----------------|--------------------|---------------|
| #bytes | #iterations | BW peak[Gb/sec] | BW average[Gb/sec] | MsgRate[Mpps] |
| 65536  | 3525074     | 0.00            | 739.26             | 1.410017      |
| 65536  | 3525051     | 0.00            | 739.25             | 1.410008      |
| 65536  | 3525101     | 0.00            | 739.26             | 1.410018      |

|       |         |      |        |          |
|-------|---------|------|--------|----------|
| 65536 | 3525299 | 0.00 | 739.26 | 1.410018 |
| 65536 | 3525085 | 0.00 | 739.26 | 1.410026 |
| 65536 | 3525086 | 0.00 | 739.26 | 1.410027 |
| 65536 | 3525077 | 0.00 | 739.26 | 1.410029 |
| 65536 | 3525071 | 0.00 | 739.25 | 1.410016 |
| 65536 | 3525072 | 0.00 | 739.26 | 1.410021 |
| 65536 | 3525087 | 0.00 | 739.26 | 1.410029 |
| 65536 | 3525084 | 0.00 | 739.25 | 1.410013 |
| 65536 | 3525085 | 0.00 | 739.26 | 1.410025 |
| 65536 | 3525078 | 0.00 | 739.26 | 1.410025 |
| 65536 | 3525071 | 0.00 | 739.25 | 1.410017 |
| 65536 | 3525075 | 0.00 | 739.26 | 1.410018 |
| 65536 | 3525086 | 0.00 | 739.26 | 1.410028 |
| 65536 | 3525083 | 0.00 | 739.26 | 1.410029 |
| 65536 | 3525085 | 0.00 | 739.26 | 1.410027 |
| 65536 | 3525077 | 0.00 | 739.26 | 1.410018 |

#### 4. Verify **lspci** speed and width.

```
root@localhost:/home/nvidia# lspci -vvvs 0002:03:00.0 | grep -i "lnk\|dev"
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbs...
-<MAbort- >SERR-
<PERR- INTx-
        DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
        DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
        DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM not supported
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
        LnkSta: Speed 32GT/s (ok), Width x16 (ok)
        DevCap2: Completion Timeout: Range ABC, TimeoutDist+ NROPrPrP- LTR-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
        LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
        LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+
EqualizationPhase1+
        VF offset: 1, stride: 1, Device ID: 101e
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
root@localhost:/home/nvidia# lspci -vvvs 0000:03:00.0 | grep -i "lnk\|dev"
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbs...
-<MAbort- >SERR-
<PERR- INTx-
        DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
        DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
        DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM not supported
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
        LnkSta: Speed 32GT/s (ok), Width x16 (ok)
        DevCap2: Completion Timeout: Range ABC, TimeoutDist+ NROPrPrP- LTR-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
        LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
        LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+
EqualizationPhase1+
```

```
VF offset: 1, stride: 1, Device ID: 101e
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
```

5. Verify that there is no system error messages by dmesg.
6. Verify temperature sensors are reporting properly on the CX-7 and CX-8 I/O board.

```
nvidia@localhost:~$ sensors|grep -i mlx5 -A2
mlx5-pci-60100
Adapter: PCI adapter
asic:      +56.0°C  (crit = +105.0°C, highest = +57.0°C)
--
mlx5-pci-0300
Adapter: PCI adapter
asic:      +48.0°C  (crit = +105.0°C, highest = +51.0°C)
--
mlx5-pci-60101
Adapter: PCI adapter
asic:      +56.0°C  (crit = +105.0°C, highest = +57.0°C)
--
mlx5-pci-20300
Adapter: PCI adapter
asic:      +49.0°C  (crit = +105.0°C, highest = +52.0°C)
```

7. Verify MLXLink on PCIe and cable for FOM and Error checks.

```
nvidia@localhost:~$ sudo mlxlink -d /dev/mst/mt4129_pciconf0 -c -e -m

Operational Info
-----
State : Active
Physical state : LinkUp
Speed : 400G
Width : 4x
FEC : Standard_RS-FEC - (544, 514)
Loopback Mode : No Loopback
Auto Negotiation : ON

Supported Info
-----
Enabled Link Speed (Ext.) : 0x0001bff2
(400G_4X, 400G_8X, 200G_2X, 200G_4X, 100G_1X, 100G_2X, 100G_4X, 50G_1X, 50G_2X, 40G, 25G, 10G, 1G)
Supported Cable Speed (Ext.) : 0x0009bffe
(800G_8X, 400G_4X, 400G_8X, 200G_2X, 200G_4X, 100G_1X, 100G_2X, 100G_4X, 50G_1X, 50G_2X, 40G, 25G, 10G, 5G, 2.5G, 1G)

Troubleshooting Info
-----
Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed.
Time to Link Up : 0.360 sec

Tool Information
```

```

-----
Firmware Version : 28.98.9122
amBER Version   : 3.2
MFT Version     : mft 4.28.0-97

Physical Counters and BER Info
-----
Time Since Last Clear [Min] : 12.4
Effective Physical Errors  : 0
Effective Physical BER     : 15E-255
Raw Physical Errors Per Lane: 475, 871, 1009, 774
Raw Physical BER Per Lane  : 2E-11, 4E-11, 5E-11, 4E-11
Link Down Counter          : 0
Link Error Recovery Counter: 0
Raw Physical BER           : 1E-11

EYE Opening Info
-----
FOM Mode : SLRG_FOM_MODE_EYEO
Lane     : 0, 1, 2, 3
Initial FOM : 109, 102, 99, 101
Last FOM   : 102, 108, 105, 108
Upper Grades : 102, 104, 104, 106
Mid Grades  : 123, 119, 116, 123
Lower Grades : 103, 102, 104, 105

Module Info
-----
Identifier : OSFP
Compliance : 800G-ETC-CR8 or 800GBASE-CR8
Cable Technology : Copper cable unequalized
Cable Type    : Passive copper cable
OUI          : Other
Vendor Name   : Amphenol
Vendor Part Number : NNDDDA-0006
Vendor Serial Number : APF243700691DN
Rev          : 1
Wavelength [nm] : N/A
Transfer Distance [m] : 0.5
Attenuation (5g,7g,12g,25g) [dB] : 3, 4, 6, 12
FW Version    : N/A
Digital Diagnostic Monitoring : No
Power Class   : N/A
CDR RX        : N/A
CDR TX        : N/A
LOS Alarm     : N/A
Temperature [C] : N/A
Voltage [mV]   : N/A
Bias Current [mA] : N/A

```

|                                  |                                |
|----------------------------------|--------------------------------|
| Rx Power Current [dBm]           | : N/A                          |
| Tx Power Current [dBm]           | : N/A                          |
| Intra-ASIC Latency [ns]          | : N/A                          |
| Module Datapath Latency [ns]     | : N/A                          |
| Round Trip Latency [ns]          | : N/A                          |
| SNR Media Lanes [dB]             | : N/A                          |
| SNR Host Lanes [dB]              | : N/A                          |
| IB Cable Width                   | : 1x,2x,4x,8x                  |
| Memory Map Revision              | : 81                           |
| Linear Direct Drive              | : 0                            |
| Cable Breakout                   | : OSFP to OSFP                 |
| SMF Length                       | : N/A                          |
| MAX Power                        | : 1                            |
| Cable Rx AMP                     | : N/A                          |
| Cable Rx Emphasis (Pre)          | : N/A                          |
| Cable Rx Post Emphasis           | : N/A                          |
| Cable Tx Equalization            | : N/A                          |
| Wavelength Tolerance             | : N/A                          |
| Module State                     | : N/A                          |
| DataPath state [per lane]        | : N/A                          |
| Rx Output Valid [per lane]       | : N/A                          |
| Nominal bit rate                 | : N/A                          |
| Rx Power Type                    | : OMA                          |
| Manufacturing Date               | : 21_09_24                     |
| Active Set Host Compliance Code  | : 800G-ETC-CR8 or 800GBASE-CR8 |
| Active Set Media Compliance Code | : N/A                          |
| Error Code Response              | : ConfigUndefined              |
| Module FW Fault                  | : N/A                          |
| DataPath FW Fault                | : N/A                          |
| Tx Fault [per lane]              | : N/A                          |
| Tx LOS [per lane]                | : N/A                          |
| Tx CDR LOL [per lane]            | : N/A                          |
| Rx LOS [per lane]                | : N/A                          |
| Rx CDR LOL [per lane]            | : N/A                          |
| Tx Adaptive EQ Fault [per lane]  | : N/A                          |

nvidia@localhost:~\$ sudo mlxlink -d /dev/mst/mt4129\_pciconf1 -c -e -m

#### Operational Info

|                  |                                |
|------------------|--------------------------------|
| -----            |                                |
| State            | : Active                       |
| Physical state   | : LinkUp                       |
| Speed            | : 400G                         |
| Width            | : 4x                           |
| FEC              | : Standard_RS-FEC - (544, 514) |
| Loopback Mode    | : No Loopback                  |
| Auto Negotiation | : ON                           |

#### Supported Info

-----

Enabled Link Speed (Ext.) : 0x0001bff2  
 (400G\_4X, 400G\_8X, 200G\_2X, 200G\_4X, 100G\_1X, 100G\_2X, 100G\_4X, 50G\_1X, 50G\_2X, 40G, 25G, 10G, 1G)

Supported Cable Speed (Ext.) : 0x0009bffe  
 (800G\_8X, 400G\_4X, 400G\_8X, 200G\_2X, 200G\_4X, 100G\_1X, 100G\_2X, 100G\_4X, 50G\_1X, 50G\_2X, 40G, 25G, 10G, 5G, 2.5G, 1G)

Troubleshooting Info

-----

Status Opcode : 0  
 Group Opcode : N/A  
 Recommendation : No issue was observed.  
 Time to Link Up : 0.365 sec

Tool Information

-----

Firmware Version : 28.98.9122  
 amBER Version : 3.2  
 MFT Version : mft 4.28.0-97

Physical Counters and BER Info

-----

Time Since Last Clear [Min] : 12.5  
 Effective Physical Errors : 0  
 Effective Physical BER : 1.5E-255  
 Raw Physical Errors Per Lane : 1233, 1773, 871, 726  
 Raw Physical BER Per Lane : 6E-11, 1E-10, 4E-11, 4E-11  
 Link Down Counter : 0  
 Link Error Recovery Counter : 0  
 Raw Physical BER : 1E-11

EYE Opening Info

-----

FOM Mode : SLRG\_FOM\_MODE\_EYEO  
 Lane : 0, 1, 2, 3  
 Initial FOM : 103, 102, 93, 105  
 Last FOM : 107, 98, 107, 104  
 Upper Grades : 103, 97, 105, 100  
 Mid Grades : 117, 108, 109, 115  
 Lower Grades : 106, 100, 101, 108

Module Info

-----

Identifier : OSFP  
 Compliance : 800G-ETC-CR8 or 800GBASE-CR8  
 Cable Technology : Copper cable unequalized  
 Cable Type : Passive copper cable  
 OUI : Other  
 Vendor Name : Amphenol  
 Vendor Part Number : NNDDDA-0006

|                                  |   |                              |
|----------------------------------|---|------------------------------|
| Vendor Serial Number             | : | APF243700691DN               |
| Rev                              | : | 1                            |
| Wavelength [nm]                  | : | N/A                          |
| Transfer Distance [m]            | : | 0.5                          |
| Attenuation (5g,7g,12g,25g) [dB] | : | 3, 4, 6, 12                  |
| FW Version                       | : | N/A                          |
| Digital Diagnostic Monitoring    | : | No                           |
| Power Class                      | : | N/A                          |
| CDR RX                           | : | N/A                          |
| CDR TX                           | : | N/A                          |
| LOS Alarm                        | : | N/A                          |
| Temperature [C]                  | : | N/A                          |
| Voltage [mV]                     | : | N/A                          |
| Bias Current [mA]                | : | N/A                          |
| Rx Power Current [dBm]           | : | N/A                          |
| Tx Power Current [dBm]           | : | N/A                          |
| Intra-ASIC Latency [ns]          | : | N/A                          |
| Module Datapath Latency [ns]     | : | N/A                          |
| Round Trip Latency [ns]          | : | N/A                          |
| SNR Media Lanes [dB]             | : | N/A                          |
| SNR Host Lanes [dB]              | : | N/A                          |
| IB Cable Width                   | : | 1x, 2x, 4x, 8x               |
| Memory Map Revision              | : | 81                           |
| Linear Direct Drive              | : | 0                            |
| Cable Breakout                   | : | OSFP to OSFP                 |
| SMF Length                       | : | N/A                          |
| MAX Power                        | : | 1                            |
| Cable Rx AMP                     | : | N/A                          |
| Cable Rx Emphasis (Pre)          | : | N/A                          |
| Cable Rx Post Emphasis           | : | N/A                          |
| Cable Tx Equalization            | : | N/A                          |
| Wavelength Tolerance             | : | N/A                          |
| Module State                     | : | N/A                          |
| DataPath state [per lane]        | : | N/A                          |
| Rx Output Valid [per lane]       | : | N/A                          |
| Nominal bit rate                 | : | N/A                          |
| Rx Power Type                    | : | OMA                          |
| Manufacturing Date               | : | 21_09_24                     |
| Active Set Host Compliance Code  | : | 800G-ETC-CR8 or 800GBASE-CR8 |
| Active Set Media Compliance Code | : | N/A                          |
| Error Code Response              | : | ConfigUndefined              |
| Module FW Fault                  | : | N/A                          |
| DataPath FW Fault                | : | N/A                          |
| Tx Fault [per lane]              | : | N/A                          |
| Tx LOS [per lane]                | : | N/A                          |
| Tx CDR LOL [per lane]            | : | N/A                          |
| Rx LOS [per lane]                | : | N/A                          |
| Rx CDR LOL [per lane]            | : | N/A                          |
| Tx Adaptive EQ Fault [per lane]  | : | N/A                          |

### Pass or Fail Criteria

Ensure Ethernet link throughput, error, and temperature sensors are all within the expected range.

## 3.7.2 CX-7 and CX-8 Mezzanine Network Card I/O Board InfiniBand Validation

### Purpose

This section contains steps to set up and run the I/O stress write test on InfiniBand protocol through OSFP medium from the CX-7 and CX-8 ports on the networking I/O board.

Partner is recommended to use the **ib\_write\_bw** on **PerfTest** package tool.

### Test Procedure

Following are the steps for Initial setup.

1. Run **mst start**
2. Check **mst status -v**
3. Verify **mlx\_0** and **mlx\_1** are listed as follows with **mst device 'mt4129'**

Following the steps for the InfiniBand test.

1. Check **ibstat**, verify port state as "Active;" verify link layer as "InfiniBand."
2. Assign IP for **mlx5\_0** and **mlx5\_1**.

```
sudo ifconfig ibp3s0 11.11.11.7 // any arbitrary IP
sudo ifconfig ibP2p3s0 11.11.11.8 // any arbitrary IP
```

3. Arm and activate the ports.

```
sudo ibportstate -C mlx5_0 -P 1 -D 0 1 arm;
sudo ibportstate -C mlx5_1 -P 1 -D 0 1 arm;
sudo ibportstate -C mlx5_1 -P 1 -D 0 1 active;
sudo ibportstate -C mlx5_0 -P 1 -D 0 1 active
```

4. Port GUID and OpenSM step.

```
opensm -g <guid>
```

5. Check **mst status -v** for InfiniBand ports.

```
root@localhost:/home/nvidia# mst status -v
MST modules:
-----
      MST PCI module is not loaded
      MST PCI configuration module loaded
PCI devices:
-----
DEVICE_TYPE      MST          PCI          RDMA          NET          NUMA
BlueField3(rev:1) /dev/mst/mt41692_pciconf0.1 0006:01:00.1  mlx5_3    net-ibP6s6f1  0
BlueField3(rev:1) /dev/mst/mt41692_pciconf0     0006:01:00.0  mlx5_2    net-ibP6s6f0  0
ConnectX7(rev:0)  /dev/mst/mt4129_pciconf1     0002:03:00.0  mlx5_1    net-ibP2p3s0  0
ConnectX7(rev:0)  /dev/mst/mt4129_pciconf0     0000:03:00.0  mlx5_0    net-ibp3s0   0
```

6. Use the **ib\_write** command to verify the throughput on **mlx5\_0**.

```
ib_write_bw' -d mlx5_0 -p 11000 -F -report_gbits -D 3 -b -run_ininitely 11.11.11.7 &
```

| RDMA_Write Bidirectional BW Test                                                         |             |                 |                    |               |
|------------------------------------------------------------------------------------------|-------------|-----------------|--------------------|---------------|
| Dual-port                                                                                | : OFF       | Device          | : mlx5_0           |               |
| Number of qps                                                                            | : 1         | Transport type  | : IB               |               |
| Connection type                                                                          | : RC        | Using SEQ       | : OFF              |               |
| PCIe Relax order                                                                         | : ON        |                 |                    |               |
| ibv_wrc API                                                                              | : ON        |                 |                    |               |
| TX depth                                                                                 | : 128       |                 |                    |               |
| CQ Moderation                                                                            | : 1         |                 |                    |               |
| Mtu                                                                                      | : 4896[8]   |                 |                    |               |
| Link type                                                                                | : IB        |                 |                    |               |
| Max inline data                                                                          | : 64B       |                 |                    |               |
| rdma_cm QPs                                                                              | : OFF       |                 |                    |               |
| Data ex. method                                                                          | : Ethernet  |                 |                    |               |
| local address: LID 0x81 QPN 0x00047 PGN 0xc2640a RKey 0x1ffffbe VAddr 0x90c8e475b800000  |             |                 |                    |               |
| remote address: LID 0x82 QPN 0x00047 PGN 0x3f003e RKey 0x1ffffbe VAddr 0x90bbe8d0c90e000 |             |                 |                    |               |
| #bytes                                                                                   | #iterations | BW peak[Gb/sec] | BW average[Gb/sec] | MsgRate[Mpps] |
| 65536                                                                                    | 2211987     | 0.00            | 773.13             | 1.474618      |
| 65536                                                                                    | 2212022     | 0.00            | 773.14             | 1.474619      |
| 65536                                                                                    | 2212037     | 0.00            | 773.14             | 1.474620      |
| 65536                                                                                    | 2212017     | 0.00            | 773.15             | 1.474642      |
| 65536                                                                                    | 2212068     | 0.00            | 773.14             | 1.474652      |
| 65536                                                                                    | 2212033     | 0.00            | 773.15             | 1.474667      |
| 65536                                                                                    | 2212015     | 0.00            | 773.15             | 1.474649      |
| 65536                                                                                    | 2212026     | 0.00            | 773.15             | 1.474648      |
| 65536                                                                                    | 2212028     | 0.00            | 773.16             | 1.474676      |
| 65536                                                                                    | 2212060     | 0.00            | 773.17             | 1.474699      |
| 65536                                                                                    | 2212031     | 0.00            | 773.15             | 1.474673      |
| 65536                                                                                    | 2211979     | 0.00            | 773.13             | 1.474633      |
| 65536                                                                                    | 2212011     | 0.00            | 773.14             | 1.474644      |
| 65536                                                                                    | 2212029     | 0.00            | 773.15             | 1.474666      |
| 65536                                                                                    | 2211979     | 0.00            | 773.14             | 1.474642      |
| 65536                                                                                    | 2211976     | 0.00            | 773.14             | 1.474641      |
| 65536                                                                                    | 2212063     | 0.00            | 773.18             | 1.474632      |
| 65536                                                                                    | 2212179     | 0.00            | 773.20             | 1.474759      |
| 65536                                                                                    | 2212163     | 0.00            | 773.20             | 1.474767      |
| 65536                                                                                    | 2212175     | 0.00            | 773.20             | 1.474759      |
| 65536                                                                                    | 2212176     | 0.00            | 773.20             | 1.474761      |
| 65536                                                                                    | 2212131     | 0.00            | 773.19             | 1.474739      |
| 65536                                                                                    | 2212066     | 0.00            | 773.17             | 1.474784      |
| 65536                                                                                    | 2212179     | 0.00            | 773.21             | 1.474775      |
| 65536                                                                                    | 2212178     | 0.00            | 773.20             | 1.474758      |
| 65536                                                                                    | 2212173     | 0.00            | 773.20             | 1.474758      |
| 65536                                                                                    | 2212163     | 0.00            | 773.19             | 1.474748      |

7. Use the **ib\_write** command to verify the throughput on **mlx5\_1**.

```
ib_write_bw -d mlx5_1 -p 11000 -F -report_gbts -D 3 -b -run_ininitely &
```

```
root@localhost:/home/nvidia# ib_write_bw -d mlx5_1 -p 11000 -F --report_gbts -D 3 -b --run_ininitely &
[1] 4924
root@localhost:/home/nvidia: WARNING: Rx peak won't be measured in this run.

=====
* Waiting for client to connect...
=====

                               RDMA_Write Bidirectional BW Test
Dual-port      : OFF          Device       : mlx5_1
Number of qps  : 1           Transport type : IB
Connection type: RC          Using SRQ    : OFF
PCIe relax order: ON
ibv_wrc API   : ON
TX depth      : 128
EQ Moderation : 1
Mtu           : 4096(B)
Link type     : IB
Max inline data: 8(B)
rdma_cm QPs   : OFF
Data ex. method: Ethernet

Local address: LID 0x82 QPN 0x00047 PSM 0x3f803a RKey 0xffffffff VAddr 0x0000000000000000
remote address: LID 0x91 QPN 0x00047 PSM 0xc2000a RKey 0xffffffff VAddr 0x0000004700000000

#bytes    #iterations    Rx peak[Gb/sec]    Rx average[Gb/sec]    MsgRate[Mpps]
00036    2212967        0.00              773.33              1.474625
00036    2212916        0.00              773.34              1.474639
00036    2212918        0.00              773.34              1.474659
00036    2212918        0.00              773.35              1.474659
00036    2212931        0.00              773.35              1.474664
00036    2212929        0.00              773.35              1.474663
00036    2212936        0.00              773.35              1.474665
00036    2212924        0.00              773.35              1.474659
00036    2212917        0.00              773.35              1.474648
00036    2212978        0.00              773.37              1.474709
00036    2212947        0.00              773.35              1.474649
00036    2212981        0.00              773.33              1.474623
00036    2212914        0.00              773.35              1.474663
00036    2212917        0.00              773.35              1.474678
00036    2212989        0.00              773.34              1.474645
00036    2212997        0.00              773.34              1.474655
00036    2212987        0.00              773.34              1.474655
00036    2212177        0.00              773.30              1.474767
00036    2212168        0.00              773.38              1.474758
00036    2212174        0.00              773.38              1.474754
00036    2212179        0.00              773.38              1.474778
00036    2212931        0.00              773.37              1.474749
00036    2212955        0.00              773.37              1.474708
```

8. Verify **lspci** speed and width through the same command under the Ethernet test section:

```
/home/nvidia# lspci -vvvs 0002:03:00.0 | grep -i "lnk\|dev"
/home/nvidia# lspci -vvvs 0000:03:00.0 | grep -i "lnk\|dev"
```

9. Verify the temperature sensors through the same command under the Ethernet test section:

```
sensors|grep -i mlx5 -A2
```

10. Verify **MLXLink** on PCIe and cable for FOM and error checks through the same command under the Ethernet test section:

```
sudo mlxlink -d /dev/mst/mt4129_pciconf0 -c -e -m
sudo mlxlink -d /dev/mst/mt4129_pciconf1 -c -e -m
```

## Pass or Fail Criteria

Ensure InfiniBand link throughput, error, and temperature sensors are all within the expected range.

# 3.8 BMC

This section details the BMC hardware testing and procedures for the GB200 and GB300 NVL systems.

## 3.8.1 BMC FRU Write

### Purpose

Verify the BMC can access FRU EEPROM information in the Bianca compute tray.

### Test Procedure

Use the following steps to program and verify that FRU write is successful.

1. Copy the <fru.bin> file to the BMC root directly via SSH or SFTP
2. Replace <fru.bin> with your fru bin file you want to program
3. Replace <i2c\_bus\_nr> with the I2C bus number the FRU you want to program is on. See Table 3-9 for the I2C address and I2C bus number.
4. Replace <i2c\_addr> with the I2C address number of the FRU you want to program. See for the I2C address and I2C bus number.
5. Write the EEPROM contents with the following command:  
`dd if=<fru.bin> of=/sys/class/i2c-dev/i2c-<i2c_bus_nr>/device/2-00<i2c_addr>/eeprom`
6. Verify that the write was successful with the command:  
`i2cdump -f -y <i2c_bus_nr> <i2c_addr>`

**Table 3-9. List of FRU EEPROMs in GB200 and GB300 Bianca Compute Tray**

| #  | FRU Name                        | Device   | I2C Logical Bus (BMC) | I2C Address | Device Type |
|----|---------------------------------|----------|-----------------------|-------------|-------------|
| 1  | Bianca Board1 NVLink Cable3 FRU | AT24C02C | 1                     | 0x55        | 24c02       |
| 2  | Bianca Board1 NVLink Cable2 FRU | AT24C02C | 1                     | 0x54        | 24c02       |
| 3  | Bianca Board1 FRU               | 24AA64   | 1                     | 0x50        | 24c64       |
| 4  | HMC FRU                         | AT24C02D | 2                     | 0x57        | 24c02       |
| 5  | Bianca Board0 NVLink Cable1 FRU | AT24C02C | 2                     | 0x55        | 24c02       |
| 6  | Bianca Board0 NVLink Cable0 FRU | AT24C02C | 2                     | 0x54        | 24c02       |
| 7  | Bianca Board0 FRU               | 24AA64   | 2                     | 0x50        | 24c64       |
| 8  | PDB FRU                         | M24C02   | 6                     | 0x50        | 24c02       |
| 9  | Bianca DC-SCM_1 FRU             | 24AA64   | 9                     | 0x50        | 24c64       |
| 10 | Bianca DC-SCM_0 FRU             | 24AA64   | 9                     | 0x51        | 24c64       |

| #  | FRU Name             | Device      | I2C Logical Bus (BMC) | I2C Address | Device Type |
|----|----------------------|-------------|-----------------------|-------------|-------------|
| 11 | BMC FRU              | CAT34C02    | 10                    | 0x50        | 24c02       |
| 12 | IPEX Bridge 0 FRU    | M24128      | 14                    | 0x55        | 24c128      |
| 13 | E1.s Backplane 0 FRU | M24128      | 14                    | 0x56        | 24c128      |
| 14 | BF3 Board0 FRU       | BR24G128NUX | 14                    | 0x50        | 24c128      |
| 15 | IPEX Bridge 1 FRU    | M24128      | 15                    | 0x55        | 24c128      |
| 16 | E1.s Backplane 1 FRU | M24128      | 15                    | 0x56        | 24c128      |
| 17 | BF3 Board1 FRU       | BR24G128NUX | 15                    | 0x50        | 24c128      |
| 18 | OSFP Board0 FRU      | M24128      | 21                    | 0x52        | 24c128      |
| 19 | CX7 IO Board0 FRU    | P24C04C     | 21                    | 0x50        | 24c64       |
| 20 | Front IO FRU         | M24128      | 23                    | 0x57        | 24c128      |
| 21 | 1G NIC FRU           | M24128      | 25                    | 0x51        | 24c128      |
| 22 | OSFP Board1 FRU      | M24128      | 33                    | 0x52        | 24c128      |
| 23 | CX7 IO Board1 FRU    | P24C04C     | 33                    | 0x50        | 24c64       |

### Pass or Fail Criteria

Verify that BMC has access to and can write all the FRU information

## 3.8.2 BMC Power Control (STANDBY, RUN, AUX)

### Purpose

Verify the BMC firmware can control power to the compute tray. Partners must implement those power features to allow for Standby power, RUN power, and AUX power cycle.

### Test Procedure

Following are the three different methods of power cycling the Bianca compute tray:

- > AUX power cycle (or AC power cycle for compute tray) is executed through the BMC AUX command to the PDB (power distribution board). The PDB will turn off power to the Bianca compute boards, cutting power to both RUN power (**12V\_RUN**) and Standby power (**12V\_STBY**). For example, from the BMC run the following:

```
stbypowerctrl.sh aux_cycle
```

- > DC power cycle is executed through the BMC to turn on or off only the RUN power (**12V\_RUN**). For example, from the BMC run the following commands:

```
powerctrl.sh power_on      #Turn on RUN power
powerctrl.sh power_off     #Turn off RUN power
powerctrl.sh power_status  #Getting RUN power status
powerctrl.sh power_cycle   #DC power cycle
```

- > Standby power control is executed through the BMC to turn on or off the STBY power (**12V\_STBY**). For example, from the BMC run the following commands:

```
stbypowerctrl.sh power_on      #Turn on STBY power
stbypowerctrl.sh power_off     #Turn off STBY power
stbypowerctrl.sh power_status  #Getting STBY power status
```



**Note:** Depending on the BMC firmware implementation, partners may need to modify the commands to execute the RUN, STBY, and AUX power control.

### Pass or Fail Criteria

- > Verify that the BMC commands are implemented correctly, and when executed, the compute tray STANDBY and RUN power can be controlled according to the commands.
- > Verify that the BMC can successfully execute an AUX power cycle.

## 3.9 System

This section details the system testing and procedures for the GB200 and GB300 NVL systems.

### 3.9.1 System Reboot and Power Cycle Stress Testing

#### Purpose

This test ensures that the quality of the GB200 and GB300 NVL72 compute trays do not deviate from their expected behavior across different modes of power reboot. Refer to Section 3.7.2 “BMC Power Control (STANDBY, RUN, AUX)” on BMC controlled power cycling.

#### Prerequisites

Ensure test capabilities for logging the dmesg and lspci outputs across each reboot cycle to detect if the output is faulty.

#### Test Procedure

1. Perform AC power cycles on the GB200 and GB300 NVL72 compute trays
  - a. Ensure all PCIe devices seen on the host system are enumerated with the expected speed and width using lspci. Here are examples of PCIe devices to check: GPU, CX-7, CX-8, BF-3, SSD, PCIe Switch, 1G NIC, and so on.
  - b. Check dmesg for any unexpected PCIe AER errors
  - c. Check if the full power sequencing has completed
  - d. Check if all I2C end points are on bus using I2C scan
  - e. Check if USB path from HMC to BMC is up

2. Perform DC power cycles on the Bianca compute trays.
  - a. Ensure all PCIe devices seen on the host system are enumerated with the expected speed and width using lspci. Here are examples of PCIe devices to check: GPU, CX-7, BF-3, SSD, PCIe Switch, 1G NIC, and so on.
  - b. Check dmesg for any unexpected PCIe AER errors
  - c. Check if the full power sequencing has completed.
  - d. Check if all I2C end points are on bus using I2C scan
  - e. Check if USB path from HMC to BMC is up
3. Perform Warm Reboot power cycles on the Bianca compute trays.
  - a. Ensure all PCIe devices seen on the host system are enumerated with the expected speed and width using lspci. Here are examples of PCIe devices to check: GPU, CX-7, BF-3, SSD, PCIe Switch, 1G NIC, and so on.
  - b. Check dmesg for any unexpected PCIe AER errors
  - c. Check if the full power sequencing has completed
  - d. Check if all I2C end points are on bus using I2C scan
  - e. Check if USB path from HMC to BMC is up

#### **Pass or Fail Criteria**

- > For each AC cycle, there are no issues with PCIe, InfiniBand, power sequencing, I2C bus scan, and USB checks.
- > For each DC cycle, there are no issues with PCIe, InfiniBand, power sequencing, I2C bus scan, and USB checks.
- > For each Warm Reboot cycle, there are no issues with PCIe, InfiniBand, power sequencing, I2C bus scan, and USB checks.

### **3.9.2 Wide Area Test**

#### **Purpose**

The wide area test (WAT) test is a series of 24-hour stress tests for the fully assembled compute tray. WAT test typically is done on multiple units to ensure that the hardware functions when scaled to a larger number of units.

#### **Prerequisites**

WAT test should be done on full-featured enabled hardware samples. This means full-featured firmware and a fully populated compute tray. Firmware does not need to be optimized for this test.

#### **Test Procedure**

Partners can utilize the NVQual package and Partner Manufacturing Diag and select the corresponding tests and loop for 24 hours.

Refer to the *NVQual User's Guide* (NVOnline: 1119931) for instructions on creating test loops to create WAT testing.

**Table 3-10. Compute Tray WAT**

| <b>ID</b> | <b>Subcategory</b>     | <b>Test Item Description</b>                                                                                                     | <b>Expected Results</b>            |
|-----------|------------------------|----------------------------------------------------------------------------------------------------------------------------------|------------------------------------|
| SY.14     | Tray CLink WAT         | Run multiple loops of NVQual CLink test for 24 hours (NVQual Test #8)                                                            | Compute tray passes without errors |
| SY.15     | Tray NVLink WAT        | Run multiple loops of NVQual NVLink test for 24 hours (NVQual Test #9)                                                           | Compute tray passes without errors |
| SY.16     | Tray PCIE WAT          | Run multiple loops of NVQual PCIe tests for 24 hours (NVQual Test #16 - #21)                                                     | Compute tray passes without errors |
| SY.17     | Tray Thermal/Power WAT | Do extended continues thermal/power stress for 24 hours (Run power/thermal stress test continuously). Stress at module max spec. | Compute tray passes without errors |

**Pass or Fail Criteria**

For each of the WAT tests, partners should check the NVQual tests passed without fail. Partners are not required to submit the WAT test logs back to NVIDIA.

### 3.9.3 Reboot WAT

**Purpose**

The wide-area test (WAT) test is a series of 24-hour stress tests to for the fully assembled compute tray. WAT test typically is done on multiple units to ensure that the hardware functions when scaled to a larger number of units.

**Prerequisites**

WAT test should be done on full-featured enabled hardware samples. This means full-featured firmware and a fully populated compute tray. Firmware does not need to be optimized for this test.

**Test Procedure**

Partners should perform each of the reboot stress tests for 1000x cycles to ensure that their system can reliably reboot upon every cycle.

**Table 3-11. Reboot WAT**

| <b>ID</b> | <b>Subcategory</b>      | <b>Test Item Description</b>                                                                                       | <b>Expected Results</b>                                                           |
|-----------|-------------------------|--------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| SY.14     | Warm reboot stress test | 1000x Warm reboots, check for reliable power control and ensure OS, PCIe, MCTP, SSIF and FPGA stability each cycle | Ensure reliable OS boot and constant PCIe and MCTP enumeration and FPGA stability |
| SY.15     | AC reboot stress test   | 1000x AC cycles by completely removing power from tray, check for reliable power control and ensure OS,            | Ensure reliable OS boot and constant PCIe and MCTP                                |

| ID    | Subcategory           | Test Item Description                                                                                                                                  | Expected Results                                                                  |
|-------|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
|       |                       | PCIe, MCTP, SSIF and FPGA stability each cycle                                                                                                         | enumeration and FPGA stability                                                    |
| SY.16 | DC reboot stress test | 1000x DC cycles (run power and standby power from BMC), check for reliable power control and ensure OS, PCIe, MCTP, SSIF and FPGA stability each cycle | Ensure reliable OS boot and constant PCIe and MCTP enumeration and FPGA stability |

### 3.9.4 L10 Partner Diagnostics WAT

#### Purpose

WAT test typically is done on multiple units to ensure that the hardware functions when scaled to a larger number of units.

Ensure the compute tray function and performance by running through an extended loop test of the L10 partner diagnostics for the compute tray.

#### Prerequisites

Partners must use the latest test fixtures and loopback cables for CX-7, CX-8, and BF-3 as listed on the *GB200 NVL72 Manufacturing Product Engineering and Test Playbook* (NVOnline: 1123054) and the *GB300 NVL72 Mfg Product Engg Test Playbook* (NVOnline: 1131118).

**Table 3-12. BF-3, CX-7, CX-8, NVLink, Loopback Cable, and Test Cage**

| Item                             | Supplier           | Part Number                             |
|----------------------------------|--------------------|-----------------------------------------|
| BF-3 Loopback cable <sup>1</sup> | Amphenol           | NAAAFQ-N906 (0.5 m)                     |
| CX-7 Loopback cable <sup>2</sup> | NVIDIA<br>Amphenol | MCP4Y10-N00A-FLT<br>NALLFQ-N906 (0.5 m) |
| CX-8 Loopback cable              | Amphenol           | NALLG3-N906                             |
| Test cage (Mini cable cartridge) | EPD5               | 1A724D100-600-G                         |
| NVLink Loopback cable            | Amphenol           | HS32838-001                             |

**Notes:**

<sup>1</sup>NAAAFQ-N906: 0.5m, QSFP 112G PAM4, 32 AWG cable with FRU EEPROM for IB and Ethernet in CMIS format, compatible with BF3 QSFP Ports.

<sup>2</sup>NALLFQ-N906: 0.5m, OSFP RHS-RHS, 112G PAM4, 32 AWG cable with FRU EEPROM for IB and Ethernet in CMIS format, compatible with CX-7 I/O.

### Test Procedure

1. Set up the CX-7, CX-8, and BF-3 loopback cables as outlined in the *GB200 and GB300 NVL72 Manufacturing Product Engineering and Test Playbook* (NVOnline: 1123054, and 1131118).
  - a. Follow the recommended cabling to minimize cable stress on the OSFP contact and interference between MTF (manufacturing test fixture) slots.
2. Run through the L10 Partner Diagnostics 10x times and ensure that their system passes every iteration of the partner diagnostics.

### Pass or Fail Criteria

All iterations of the L10 partner diagnostics passes.

## 3.10 Environmental, Reliability, and Electromagnetic Compatibility

NVIDIA recognizes the diversity of potential use cases that system partners may encounter. The following test lists are guidance from NVIDIA for the areas concerning environmental factors, reliability, and robustness.

Partners are recommended to modify and/or conduct additional testing that may be necessary for their specific use case and ensure that all potential scenarios are addressed appropriately.

### 3.10.1 L10 Package Testing

#### Purpose

The purpose of performing package testing is to ensure that the packaged compute tray can withstand the physical stress it may encounter during transportation, storage, and handling.

#### Test Procedure

Partners should test to the industrial test standards of ISTA. Perform the test items in the following table using the test conditions highlighted in the “Test Item Description” column.

**Table 3-13. L10 Packaging Test Summary**

| ID   | Subcategory                                                       | Test Item Description                                                                                                                              | Test Procedure                        |
|------|-------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------|
| PK.1 | Package/Product Visual Inspection and Product Functional PRE-Test | Run functional test on the product using the latest manufacturing diagnostic. Perform visual inspection and record all noted cosmetic observations | Visual inspection and MFG Diagnostics |

| <b>ID</b> | <b>Subcategory</b>                                                 | <b>Test Item Description</b>                                                                                                                                                                                                                           | <b>Test Procedure</b>                 |
|-----------|--------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------|
| PK.2      | Atmospheric Pre-Conditioning                                       | Packaged product shall be preconditioned to laboratory ambient temperature and humidity for minimum of 12 hours                                                                                                                                        | ISTA 3B, Block #1                     |
| PK.3      | Atmospheric Conditioning (Required only for non rigid containers)  | [Hot, Humid] then [Extreme Heat, Moderate RH]<br>[38°C, 85% RH for 72 hours] then [60°C, 30% RH for 6 hours]                                                                                                                                           | ISTA 3B, Block #1                     |
| PK.4      | Tip Test                                                           | Test up to 22-degree angle, without letting fall. Report on center of gravity tilt angle.<br>If packaged product fail this test, DO NOT let it fall to the floor.<br>Gently return the product to upright position and note failure. Continue testing. | ISTA 3B, Block #2                     |
| PK.5      | Shock – Rotational Drop                                            | Drop height 6 in (150 mm) for packaged products weighing 500 lb (230 Kg) or more                                                                                                                                                                       | ISTA 3B, Block #5                     |
| PK.6      | Shock – Incline or Horizontal Impact                               | Impact velocity 48 in/sec (1.2 m/sec) minimum or 3 in (76 mm) drops                                                                                                                                                                                    | ISTA 3B, Block #6                     |
| PK.7      | Truck Vibration - Random with top load                             | Overall 0.54 Grms, 120 minutes                                                                                                                                                                                                                         | ISTA 3B, Block #9                     |
| PK.8      | Air Vibration - Random                                             | Combination of High (0.29Grms), Medium (0.22 Grms) and Low (0.16 Grms) air profile; 120 minutes                                                                                                                                                        | ASTM D4169-23, section 12.4.2.2       |
| PK.9      | Concentrated Impact (Required only for non rigid containers)       | Free-fall, or guided fall drop                                                                                                                                                                                                                         | ISTA 3B, Block #10                    |
| PK.10     | Fork Lift Handling (Required only for non rigid containers)        | Flat Push and Rotate Tests, Elevated Push and Pull Tests, Elevated Rotate Tests, Load stability test over a handling course.                                                                                                                           | ISTA 3B, Block #15                    |
| PK.11     | Shock – Rotational Drop                                            | Drop height 6 in (150 mm) for packaged products weighing 500 lb (230 Kg) or more                                                                                                                                                                       | ISTA 3B, Block #13                    |
| PK.12     | Shock – Incline or Horizontal Impact                               | Impact velocity 48 in/sec (1.2 m/sec) minimum or 3 in (76 mm) drops                                                                                                                                                                                    | ISTA 3B, Block #14                    |
| PK.13     | Package/Product Visual Inspection and Product Functional POST-Test | Run functional test on the product using the latest manufacturing diagnostic. Perform visual inspection and record all noted cosmetic observations                                                                                                     | Visual inspection and MFG Diagnostics |

## Pass or Fail Criteria

Test items in Table 3-13 are completed and passed. The product did not sustain any damage.

- > Product is damage-free; no structural damage including no detached, loose, fractured, or deformed material part(s)
- > Product cosmetic areas are not degraded beyond manufacturing or final acceptance criteria. Cosmetic damage is any abnormality that makes the product unacceptable to the customer.
- > No conductive particles (wire, connectors, and so on) should be exposed because of the testing.
- > The product should meet all manufacturing specifications and tolerances after testing
- > All product electrical and software functions perform to specification

## 3.10.2 Shock and Vibration Test

### Purpose

The purpose of shock and vibration testing is to ensure that the compute tray can withstand physical stress during transportation, handling, and operation. The testing simulates the effects of sudden impacts and continuous or repetitive movements to verify the compute tray's durability and functionality.

### Test Procedure

Perform the test items in the following table using the test conditions highlighted in the "Test Item Description" column.

**Table 3-14. Shock and Vibration Test Summary**

| ID   | Subcategory                      | Test Item Description                                                                                                                                           | Expected Results                                                                               |
|------|----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
| SV.1 | Vibration Random (Operating)     | Freq (Hz) / PSD level (G2/Hz) / Slope (dB/Oct)<br>5-10 / - / +12<br>10-50 / 0.0002 / -<br>50-100 / - / -12<br>nominal of 0.1 Grms (X, Y, Z)<br>30 min/axis / OF | No permanent system damage, no parts or connectors dislodged and the system stays operational. |
| SV.2 | Vibration Sinusoidal (Operating) | 5-500-5 Hz, 1.0 oct/min<br>0.1G, one sweep/axis, (Z) only<br>1 sweep / OF                                                                                       | No permanent system damage, no parts or connectors dislodged and the system stays operational. |
| SV.3 | Vibration Random (Non-Operating) | Freq (Hz) / PSD level (G2/Hz) / Slope (dB/Oct)<br>0.0002 G2/Hz at 2 Hz                                                                                          | No permanent system damage and no parts or connectors dislodged.                               |

| ID   | Subcategory                          | Test Item Description                                                                                                                                         | Expected Results                                                                               |
|------|--------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
|      |                                      | 0.01 G2/Hz at 12 Hz<br>0.01 G2/Hz at 100 Hz<br>0.00001 G2/Hz at 500 Hz<br>nominal of 1.1 Grms<br>15 min, Z axis<br>Bottom, Top, Left, Right, Front, Back / OF |                                                                                                |
| SV.4 | Vibration Sinusoidal (Non-Operating) | 5-500-5 Hz, 0.25 oct/min<br>0.5 G, one sweep/axis, (Z) only<br>0.25 G, one sweep/axis, (X, Y)<br>1 sweep per axis / OF                                        | No permanent system damage and no parts or connectors dislodged.                               |
| SV.5 | Mechanical Shock (Operating)         | See Table 3-15 for items marked with "x" for required test G level and test setup.<br>Total 18x shocks (half-sine)                                            | No permanent system damage, no parts or connectors dislodged and the system stays operational. |
| SV.6 | Mechanical Shock (Non-Operating)     | See Table 3-16 for items marked with "x" for required test G level and test setup.<br>Total 6x shocks (trapezoidal)                                           | No permanent system damage and no parts or connectors dislodged.                               |

**Table 3-15. Mechanical Shock – Half Sine (Operating)**

| System Weight (lbs) |           | 3G/11 ms | 6G/11 ms | Test Setup                                    |  |
|---------------------|-----------|----------|----------|-----------------------------------------------|--|
| 1U/2U               | < 70 lbs  | --       | x        | 3 axes: ±X, ±Y, ±Z<br>3 shocks each direction |  |
|                     | > 70 lbs  | x        | Desired  |                                               |  |
| 4U                  | < 200 lbs | x        | Desired  |                                               |  |
|                     | > 200 lbs | x        | --       |                                               |  |
| 6 U                 | > 200 lbs | x        | --       |                                               |  |

**Table 3-16. Mechanical Shock – Trapezoidal (Non-operating)**

| System Weight (lbs) |           | 15G/11 ms | 20G/11 ms | 30G/11 ms | 40G/11 ms | Test Setup                                     |
|---------------------|-----------|-----------|-----------|-----------|-----------|------------------------------------------------|
| 1U/2U               | < 70 lbs  | --        | --        | x         | Desired   | 6 sides: Bottom, Top, Left, Right, Front, Back |
|                     | > 70 lbs  | --        | x         | Desired   | --        | 6 sides: Bottom, Top, Left, Right, Front, Back |
| 4U                  | < 200 lbs | --        | x         | --        | --        | 6 sides: Bottom, Top, Left, Right, Front, Back |
|                     | > 200 lbs | x         | --        | --        | --        | 3 axes: X, Y, Z                                |

| <b>System Weight (lbs)</b> |           | <b>15G/11 ms</b> | <b>20G/11 ms</b> | <b>30G/11 ms</b> | <b>40G/11 ms</b> | <b>Test Setup</b>                   |
|----------------------------|-----------|------------------|------------------|------------------|------------------|-------------------------------------|
|                            |           |                  |                  |                  |                  | Bottom side down                    |
| 6 U                        | > 200 lbs | x                | --               | --               | --               | 3 axes: X, Y, Z<br>Bottom side down |

### Pass or Fail Criteria

Test items in Table 3-14 are completed and passed.

## 3.10.3 Environmental Reliability Test

### Purpose

The purpose of the environmental reliability test is to evaluate the hardware's ability to operate effectively and reliably under various environmental conditions, such as temperature extremes, humidity, and altitude.

### Test Procedure

Perform the test items in the following table using the test conditions outlined in the "Test Item Description" column.

**Table 3-17. Reliability Test Summary**

| <b>ID</b> | <b>Subcategory</b>                                        | <b>Test Item Description</b>                                                                                                                                                                                                                                                                 | <b>Expected Results</b>                    |
|-----------|-----------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------|
| R.1       | Tray Reliability Pre-Baseline                             | No device is missing and Partner_MFG_Diag must pass.                                                                                                                                                                                                                                         | The system is stable and passes all tests. |
| R.2       | Tray Reliability Temperature and Humidity (Non-Operating) | 1. 25°C, 50% RH for 1 hour<br>2. -40°C, No Humidity control for 16 hours<br>3. 70°C, 10% RH for 16 hours<br>4. 70°C, 90% RH for 16 hours<br>5. 40°C, 93% RH for 96 hours<br>6. 25°C, 50% RH for 4 hour<br>> Max Ramp rate : 20°C per hour; 20% RH/hr<br>> Wet vs. Dry, based on product spec | No permanent system damage or degradation  |
| R.3       | Tray Reliability Temperature and Humidity (Operating)     | 1. 25°C, 50% RH for 1 hour<br>2. 0°C, No Humidity control for 24 hours<br>3. Max temp <sup>1</sup> °C, 50% RH for 24 hours<br>4. 31°C, 85% RH for 24 hours<br>5. 25°C, 10% RH for 24 hours                                                                                                   | No permanent system damage or degradation  |

| <b>ID</b> | <b>Subcategory</b>                                            | <b>Test Item Description</b>                                                                                                                                                                | <b>Expected Results</b>                                        |
|-----------|---------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------|
|           |                                                               | > Max Ramp1 rate: 20°C per hour;<br>20% RH/hr                                                                                                                                               |                                                                |
| R.4       | Tray Reliability Thermal cycling (Non-Operating)              | Ta: 0°C to 100°C<br>15 min dwell, ramp rate 10°C/min<br>100 cyc / 0F                                                                                                                        | No permanent system damage                                     |
| R.5       | Tray Reliability Thermal Shock (Non-Operating)                | Ta: -20°C to 85°C,<br>30°C/min, 1 hr dwell<br>127 min/cyc<br>100 cyc / 0F                                                                                                                   | No permanent system damage                                     |
| R.6       | Tray Reliability Connector Durability                         | 50 cycles / 0F                                                                                                                                                                              | No permanent system damage or degradation                      |
| R.7       | Tray Reliability Hard Boot (AC Power ON/OFF) low temperature  | 1. 0°C for 500 boot cycles<br>2. 500 cyc / 2F<br>3. 50% R.H.                                                                                                                                | No boot failures and no permanent system damage or degradation |
| R.8       | Tray Reliability Hard Boot (AC Power ON/OFF) high temperature | 1. Max Temp <sup>1</sup> °C for 500 boot cycles<br>2. 500 cyc / 2F<br>3. 50% R.H.                                                                                                           | No boot failures and no permanent system damage or degradation |
| R.9       | Tray Reliability Soft Boot (OS Power ON/OFF) low temperature  | 1. 0°C for 500 boot cycles, 50% R.H.<br>2. 500 cyc / 3F<br>3. 50% R.H.                                                                                                                      | No boot failures and no permanent system damage or degradation |
| R.10      | Tray Reliability Soft Boot (OS Power ON/OFF) high temperature | 1. Max Temp <sup>1</sup> °C for 500 boot cycles<br>2. 500 cyc / 3F<br>3. 50% R.H.                                                                                                           | No boot failures and no permanent system damage or degradation |
| R.12      | Tray Reliability Four Corner High Temp and High Voltage       | Max Temp <sup>1</sup> °C, +5% voltage, run diagnostics x1 loop                                                                                                                              | Systems functions normally with no failures or damage          |
| R.13      | Tray Reliability Four Corner High Temp and Low Voltage        | Max Temp <sup>1</sup> °C, -5% voltage, run diagnostics x1 loop                                                                                                                              | Systems functions normally with no failures or damage          |
| R.14      | Tray Reliability Four Corner Low Temp and High Voltage        | 0°C, +5% for 0.5 hours                                                                                                                                                                      | Systems functions normally with no failures or damage          |
| R.15      | Tray Reliability Four Corner Low Temp and Low Voltage         | 0°C, -5% for 0.5 hours                                                                                                                                                                      | Systems functions normally with no failures or damage          |
| R.16      | Tray Reliability Temperature and Altitude (Non-Operating)     | 1. 25°C, for 1 hour<br>2. -40°C, 40000 ft for 10 hours<br>3. 70°C, 40000 ft for 10 hours<br>> Ramp rate: not exceed 1°C / min, 50 mbar/min<br>> Wet vs. Dry, based on product specification | No permanent system damage or degradation                      |

| ID   | Subcategory                                                 | Test Item Description                                                                                                                                                                                                                                                   | Expected Results                                                                                   |
|------|-------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| R.17 | Tray Reliability<br>Temperature and Altitude<br>(Operating) | 1. 5°C, 1,000 ft<br>2. 25°C, 10,000 ft<br>3. 5°C, 10,000 ft<br>4. 5°C, 5,000 ft<br>5. 25°C, 5,000 ft<br>6. 30°C, 5,000 ft<br>7. 30°C, 1,000 ft<br>8. Max Temp <sup>1</sup> °C, 1000 ft<br>> 8-hr dwell at each point<br>> Ramp rate: not exceed 1°C/min,<br>50 mbar/min | No permanent system damage or degradation                                                          |
| R.19 | Tray Reliability CMTBF Analysis                             | 1. Telcordia SR-332 Issue 4<br>2. Method I, Case 1<br>3. 25°C and Max. Temp °C<br>4. Ground benign operating environment                                                                                                                                                | Ensure system CMTBF meets expectations and if there are preventable failures those are highlighted |
| R.20 | Tray Reliability DMT                                        | Functional test at 35°C for ~60 days.<br>Tested on 4x trays.                                                                                                                                                                                                            | No permanent system damage or degradation                                                          |
| R.25 | Tray Reliability Post Baseline                              | No device is missing and diag must pass                                                                                                                                                                                                                                 | System is stable and passes all tests and performance is equivalent to pretest baseline.           |

**Note:**

<sup>1</sup>Max Temp is the max operational and non-operational temperature per product specification.

### Pass or Fail Criteria

The test items in Table 3-17 are completed and passed.

## 3.10.4 Electromagnetic Compatibility

### Purpose

The purpose of electromagnetic compatibility (EMC) testing for server products is to ensure that the servers operate reliably without causing or being affected by electromagnetic interference in their intended environment.

Partners must perform EMC testing according to their intended regulatory standards for electromagnetic emissions and immunity. The following tests can be used for partner reference.

### Test Procedure

Perform the test items in the following table according to the government standards outlined in the “Reference Standards” column.

**Table 3-18. EMC Tests Summary**

| <b>ID</b> | <b>Subcategory</b>                                | <b>Test Item Description</b>                                                                                                                                                                                    | <b>Reference Standards</b>                                             |
|-----------|---------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|
| EM.1      | Radiated emissions per standard EN55032           | 6dB under the class A limit line                                                                                                                                                                                | EN55032 / FCC CFR 47 Part 15 Subpart B                                 |
| EM.2      | Conducted emissions                               | 6dB under the class A limit line                                                                                                                                                                                | EN55032 / FCC CFR 47 Part 15 Subpart B                                 |
| EM.3      | Harmonic current emission                         | Limits for harmonic current emissions (equipment input current $\leq 16$ A per phase)                                                                                                                           | EN 61000-3-2 or<br>EN 61000-3-11 ( $> 16$ A and $\leq 75$ A per phase) |
| EM.4      | Voltage fluctuations and flicker                  | Limitation of voltage fluctuations and flicker in low-voltage supply systems for equipment with rated current $\leq 16$ A                                                                                       | EN 61000-3-3 or<br>EN 61000-3-12 ( $> 16$ A and $\leq 75$ A per phase) |
| EM.5      | Electro-static Discharge                          | $\pm 8$ kV Contact discharge, $\pm 15$ kV Air discharge<br><br>Device shows a malfunction but recovers automatically (without user interaction) and then continues to operate normally.                         | EN61000-4-2                                                            |
| EM.6      | Radiated Radio Frequency, Electric Field Immunity | 1kHz 80% AM modulation, radiated electromagnetic field of 3 V/m<br><br>➢ 80 MHz to 1 GHz<br>➢ 1800 MHz ( $\pm 1$ %)<br>➢ 2600 MHz ( $\pm 1$ %)<br>➢ 3500 MHz ( $\pm 1$ %)<br>➢ 5000 MHz ( $\pm 1$ %)            | EN61000-4-3                                                            |
| EM.7      | Electrical Fast Transient Burst                   | $\pm 2$ kV Mains<br>$\pm 1$ kV signal ports $> 3$ m                                                                                                                                                             | EN 61000-4-4                                                           |
| EM.8      | Surge                                             | Test 1.2/50 (8/20) $\mu$ s<br><br>➢ $\pm 1.2$ kV differential<br>➢ $\pm 2.2$ kV common                                                                                                                          | EN 61000-4-5                                                           |
| EM.9      | Radio-Frequency Fields                            | 1 kHz 80% AM modulation<br><br>➢ 150 kHz to 10 MHz injected level of 3 Vrms<br>➢ 10 MHz to 30 MHz injected level of 3 to 1 Vrms<br><br>OPSSIEMC4001 18 June 2021<br>➢ 30 MHz to 80 MHz injected level of 1 Vrms | EN61000-4-6                                                            |
| EM.10     | Magnetic Fields                                   | Magnetic field of 3A/m at 50Hz                                                                                                                                                                                  | EN61000-4-8                                                            |
| EM.12     | Tray Safety Test                                  | CB and UL/cTUVus                                                                                                                                                                                                | IEC 62368-1 ED2 +ED3,<br>UL62368-1 ED3                                 |

### **Pass or Fail Criteria**

The test items in Table 3-18 or partner-selected regulatory standards are completed and passed.

Rik Kisnah NVIDIA Confidential Oracle Labs - NVL  
1122395 2025-11-28 16:00:27

---

# Chapter 4. L10 Compute Tray System Software

Partners should review the corresponding software collaterals outlined in Table 4-1 and implement the necessary features before starting the compute tray software validation.

Refer to the latest Grace Software Partner Enablement (NVOnline: 1093429) deck for a full list of relevant software collaterals.

**Table 4-1. Reference Documentation and Collateral**

| NVOnline ID | Title                                                         | Focus Area           |
|-------------|---------------------------------------------------------------|----------------------|
| 1093429     | Grace Software Partner Enablement                             | All                  |
| 1099753     | Grace Firmware Reference Guide                                | Boot Firmware        |
| 1097474     | Grace UEFI External Architecture Specification                | Boot Firmware        |
| 1110059     | SBIOS Requirements for NVIDIA Grace Server Platforms          | Boot Firmware        |
| 1105630     | Grace Platform SBIOS Build Tools and UEFI Reference Code      | Boot Firmware        |
| 1086146     | Grace System PLDM Product Descriptor Record                   | Boot Firmware        |
| 1094659     | Grace OpenBMC External Architecture Specification             | Baseboard Management |
| 1116996     | GB200 NVL Firmware External Architecture Specification        | Baseboard Management |
| 1110060     | BMC Requirements for NVIDIA Grace Server Platforms            | Baseboard Management |
| 1096064     | NVIDIA Grace Baseboard Redfish Model and Schema Guide         | Baseboard Management |
| 1092300     | NVIDIA Data Center Products Telemetry Catalog                 | Baseboard Management |
| 1030060     | SMBUS Post Box Interface (SMBPBI) for NVIDIA Baseboards       | Baseboard Management |
| 1099558     | Grace Baseboard Firmware Update and Security Guide            | Baseboard Management |
| 1090539     | Grace OpenBMC Reference Code                                  | Baseboard Management |
| 1098624     | Device Ownership Transfer External Architecture Specification | ERoT, DOT, Security  |
| 1105290     | NVIDIA Device Ownership Transfer User Guide                   | ERoT, DOT, Security  |
| 1109481     | NVIDIA SPDM Measurement Block Definition                      | ERoT, DOT, Security  |
| 1109972     | MCTP Vendor Defined Messages for CEC1736                      | ERoT, DOT, Security  |
| 1100855     | NVIDIA Grace RAS Overview                                     | RAS                  |
| 1116117     | NVIDIA Server RAS Catalog                                     | RAS                  |

| NVOnline ID | Title                                               | Focus Area |
|-------------|-----------------------------------------------------|------------|
| 1115302     | NVIDIA CPER Catalog                                 | RAS        |
| 1109712     | Debug and RAS Guide for NVIDIA Data Center Products | RAS        |
| 1116450     | NVIDIA Defined CPER Extension to Arm LibCper        | RAS        |

The definitive deliverable for L10 software validation is the successful completion and subsequent submission of the NVSSVT and NVRAS Tool. This ensures a standardized measure of software quality across our partner platforms.

Refer to the Section 1.2.2 “NVSST” for more information on running the NVSSVT for software validation.

## 4.1 System Software Validation Guidelines

Partners are strongly encouraged to review the software requirements documentation to formulate their software test plan. The software areas of interest in this section can serve as a basis for partners to formulate their software test plans.

Partners are recommended to modify or conduct additional testing that may be necessary for their specific use case. For example, partners can develop test cases to validate areas such as the ones in Table 4-2.

**Table 4-2. System Software Validation Example**

| Category          | Feature to Check                                                                                                                                                                                                                                                                                                               |
|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| BMC               | Check BMC firmware features against NVIDIA BMC architecture specifications<br>Verify CPER log support<br>Verify POST Code capture<br>Verify time management such as RTC, NTP<br>Check for User Management: LDAP & AD user<br>Device Ownership Transfer (DOT) for Grace CPUs                                                    |
| IPMI              | IPMI Standard 2.0- Compliance<br>IPMI Network Management: Static IP, DHCP<br>IPMI SDR (Sensor Data Record), System Event Log (SEL)<br>IPMI User Management- IPMI users and Role Back Access Control<br>IPMI - RMCP+ Support Only<br>IPMI – SOL (serial over lane) feature<br>IPMI Chassis control – power management, PEX boot |
| Inventory check   | Check for server hardware, FRU inventory, firmware and software inventory                                                                                                                                                                                                                                                      |
| Firmware security | TPM function check                                                                                                                                                                                                                                                                                                             |

| Category               | Feature to Check                                                                                                                                                                                                                                                                                                                |
|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                        | Verify secure boot support<br>Verify write-protect feature for Host BMC control of the motherboard                                                                                                                                                                                                                              |
| Firmware update        | Verify motherboard APs and ERoTs are capable of firmware upgrade<br>OOB Host BMC firmware update for all APs (CPU, GPU, FPGA, and so on) and ERoTs<br>Firmware update rollback – retain previous good images before failed firmware update                                                                                      |
| SBIOS                  | Check SBIOS features against NVIDIA SBIOS architecture specifications<br>OS installation via USB, PXE, HTTP boot, virtual media                                                                                                                                                                                                 |
| Redfish Implementation | Verify Redfish API implementation, such as:<br><ul style="list-style-type: none"> <li>➢ Redfish Root, Redfish Managers, Redfish LogServices</li> <li>➢ UEFI settings control by Redfish</li> <li>➢ System Boot by Redfish</li> <li>➢ Remote chassis power management by Redfish</li> </ul>                                      |
| Telemetry              | Check for in-band and OOB telemetry implementation against the NVIDIA Telemetry Catalog                                                                                                                                                                                                                                         |
| Server RAS             | Verify features against server RAS catalog<br>Verify features NVIDIA CPER catalog<br>Verify EINJ (error injection) response against error catalog                                                                                                                                                                               |
| Benchmarks             | Test system against performance benchmark suites:<br><ul style="list-style-type: none"> <li>➢ nvbandwidth</li> <li>➢ NCCL</li> <li>➢ Fuse Multiple Add for Grace CPU</li> <li>➢ DGEMM</li> <li>➢ Multichase</li> <li>➢ OpenSSL</li> <li>➢ cublasMatMulBnech</li> <li>➢ IdleLatency</li> <li>➢ L1L3Latency, and so on</li> </ul> |

# Chapter 5. L11 Server Rack Hardware Validation

This chapter describes the hardware validation items of the L11 server rack.

## 5.1 Rack Power

This section details the rack power testing and procedures for the GB200 and GB300 NVL systems.

Partners must work with their facility and lab safety personnel when performing rack-level power validation and review the following points before conducting measurements.



- Caution:** Working with 3-phase AC power can be extremely dangerous. To ensure safety, adhere to the following guidelines:
- > **Qualified Personnel Only:** Ensure that all test setups are verified by a qualified electrician before proceeding.
  - > **Personal Protective Equipment (PPE):** When necessary, wear appropriate PPE, including insulated gloves, safety glasses, and protective clothing.
  - > **Clear Area:** Keep the test area clear of unnecessary personnel and equipment. Ensure that all non-essential personnel are at a safe distance.
  - > **Emergency Procedures:** Familiarize yourself with emergency procedures and ensure that emergency shutdown mechanisms are easily accessible.
  - > **Proper Grounding:** Verify that all equipment is properly grounded to prevent electrical shock.
  - > **Signage and Barriers:** Use clear signage and physical barriers to mark high voltage areas and restrict access.
  - > **Double-Check Connections:** Before powering up, double-check all connections and ensure that they are secure and correctly configured.
  - > **Stay Alert:** Always stay alert and focused when working with high voltage equipment. Avoid distractions and never work alone.

## 5.1.1 Rack Power Startup and Shutdown

### Purpose

Verify the server rack power startup and shutdown behavior is within the bus bar power specification.

### Prerequisite

Partners can use test load sleds or electronic loads to draw power from the power shelf, instead of assembling the full server rack system.

### Test conditions

Test under the following input voltage conditions: 200, 208, 220, 240, 277, and 305 VAC.

Test with the following redundancy configurations:

- > All PSUs operational (N+1 per shelf, N+1 per rack)
- > PSUs in non-redundant configuration depending on the rack redundancy model (N+0)

### Test Procedure

For each test condition follow these steps:

1. Turn on all PSU AC input at the same time
2. Turn on A-feed group or B-feed group staggard (if an N+N redundancy model) with 1-second, 10-second, 30-second, or 60-second delay between feed group powering.
3. Turn on on-whips 1-by-1 with at least 10-second delay between whips.
4. Measure and record the following parameters for each test condition:
  - a. Startup and shutdown sequence
  - b. DC bus bar voltage
  - c. Vshare bus for power shelves
  - d. AC input current, AC input voltage
  - e. PSC (power shelf controller) event logs
  - f. Monitor power shelf LEDs

### Pass or Fail Criteria

- > Bus bar voltage is always within specifications after the first power shelf initiates start-up.
- > Ramp-up rate (10% to 90%) is within 60 ms +/- 10%. No overshoot or ringing.

## 5.1.2 Rack Power Startup and Shutdown without PSC

### Purpose

Verify that the server rack's power startup and shutdown behavior is within the bus bar power specification when one of the power shelves does not have a PSC.

## Prerequisite

Partners can use test load sleds or electronic loads to draw power from the power shelf, instead of assembling the full server rack system.

## Test conditions

Test under the following input voltage conditions: 200, 208, 220, 240, 277, and 305 VAC.

Test with the following redundancy configurations:

- > All PSUs operational (N+1 per shelf, N+1 per rack)
- > PSUs in non-redundant configuration depending on the rack redundancy model (N+0)

## Test Procedure

For each test condition follow these steps:

1. Turn on all PSU AC input at the same time
2. Turn on A-feed group or B-feed group staggard (if an N+N redundancy model), without specific delay time, between feed group powering
3. Turn on on-whips 1-by-1 with at least 10-second delay between whips.
4. Measure and record the following parameters for each test condition:
  - a. Startup and shutdown sequence
  - b. DC bus bar voltage
  - c. Vshare bus for power shelves
  - d. AC input current, AC input voltage
  - e. PSC (power shelf controller) event logs
  - f. Monitor power shelf LEDs

## Pass or Fail Criteria

- > Bus bar voltage is always within specifications after the first power shelf initiates start-up.
- > Ramp-up rate (10% to 90%) is within 60 ms +/- 10%. No overshoot or ringing.

## 5.1.3 Rack Power Load Voltage Regulators under Static Load

### Purpose

Verify bus bar power delivery and power balancing are stable across static load levels while applying different static loads using test load sleds.

### Prerequisite

Partners can use test load sleds or electronic loads to draw power from the power shelf, instead of assembling the full server rack system.

### Test conditions

Test under the following input voltage conditions: 200, 208, 220, 240, 277, and 305 VAC.

Test with the following static load conditions: 10%, 25%, 50%, 75%, or 100%

### **Test Procedure**

For each test condition follow these steps:

1. Turn on all PSU AC inputs at the same time.
2. Apply static load conditions from the load sleds or electronic loads across different static load conditions.
3. Measure and record the following parameters for each test condition:
  - a. DC bus bar voltage
  - b. Vshare bus for power shelves
  - c. AC input current, AC input voltage
  - d. PSC (power shelf controller) event logs
  - e. Monitor power shelf LEDs

### **Pass or Fail Criteria**

Sharing tolerance for PSUs within common power shelf:

- > +/- 5% (tentative) at 40% -100% load
- > +/- 7% (tentative) at 20% - 30% load
- > +/-10% (tentative) at 10% load

Sharing tolerance for PSUs within common rack:

- > +/-7% (tentative) at 40% -100% load
- > +/- 9% (tentative) at 20% - 30% load
- > +/-12% (tentative) at 10% load

## **5.1.4      Rack Power Load Voltage Regulators under Dynamic Load**

### **Purpose**

Verify bus bar power delivery and power balancing are stable across static load levels, while under dynamic loads using test load sleds.

### **Prerequisite**

Partners can use test load sleds or electronic loads to draw power from the power shelf, instead of assembling the full server rack system.

### **Test conditions**

Test under the following input voltage conditions: 200, 208, 220, 240, 277, and 305 VAC.

Test with the following load conditions:

- > Sweep from 10% - 90% load condition, with 50 ms-ON and 50 ms-OFF
- > Sweep from 30% - 100% load condition, with 50 ms-ON and 50 ms-OFF
- > Test to power shelf peak power rating with 50 ms-ON and 50 ms-OFF

Test with the following slew rates:

- > 1 A/ $\mu$ s
- > 6 A/ $\mu$ s
- > 12 A/ $\mu$ s

### **Test Procedure**

1. Apply the input voltage condition VAC
2. Program the electronic load to the designated slew rate (1 A/ $\mu$ s, 6 A/ $\mu$ s, or 12 A/ $\mu$ s)
3. Program the electronic load to sweep through each load condition with 50 ms-ON and 50 ms OFF pulse
4. Measure and record the following parameters for each test condition:
  - a. DC bus bar voltage
  - b. Vshare bus for power shelves
  - c. AC input current, AC input voltage
  - d. PSC (power shelf controller) event logs
  - e. Monitor power shelf LEDs

### **Pass or Fail Criteria**

Sharing tolerance for PSUs within common power shelf:

- > +/- 5% (tentative) at 40% -100% load
- > +/- 7% (tentative) at 20% -30% load
- > +/-10% (tentative) at 10% load

Sharing tolerance for PSUs within common rack:

- > +/-7% (tentative) at 40% -100% load
- > +/- 9% (tentative) at 20% -30% load
- > +/-12% (tentative) at 10% load

## **5.1.5      Rack Power Load Hot Swap PSU**

### **Purpose**

Verify that PSU modules can successfully be hot swapped in a power shelf without interrupting the server rack system.

### **Prerequisite**

Partners can use test load sleds or electronic loads to draw power from the power shelf, instead of assembling the full server rack system.

### **Test conditions**

Test under the following input voltage conditions: 240, 277, and 305 VAC.

For racks with N+N power redundancy, test with only N feeds.

### Test Procedure

1. Turn on the electronic load, and load the rack power system with 100% load
2. Hot swap PSU modules one at a time inside a power shelf while the rack is operational
3. Monitor and record PSU LED status and PSC status telemetry after hot swap

### Pass or Fail Criteria

- > PSU modules can successfully be hot swapped in a power shelf without interrupting the system.
- > PSU and PSC should continue to regulate the bus bar voltage within range, with real-time current sharing, no OVP, UVP, OCP, or any other faults to be triggered due to interrupt or hiccup on the bus bar voltage.
- > No fault is registered in the PSU and PSC telemetry status.
- > All LEDs are showing the correct status.

## 5.1.6 Rack Power Load Hot Swap PSC

### Purpose

Verify that PSC modules can successfully be hot swapped in a power shelf without interrupting the server rack system.

### Prerequisite

Partners can use test load sleds or electronic loads to draw power from the power shelf, instead of assembling the full server rack system.

### Test conditions

Test under the following input voltage conditions: 240 VAC.

### Test Procedure

1. Turn on the electronic load, and load the rack power system with 100% load
2. Hot swap the PSC module inside a power shelf while the rack is operational
3. Monitor and record neighboring PSU LED status

### Pass or Fail Criteria

- > PSU and PSC should continue to regulate the bus bar voltage within range, with real-time current sharing, no OVP, UVP, OCP, or any other faults to be triggered due to interrupt or hiccup on the bus bar voltage.
- > No fault is registered in the PSU and PSC telemetry status.
- > All power shelf LEDs are showing the correct status.

## 5.1.7 Rack Power Fault Recovery

### Purpose

Verify that power shelf protection and fault recovery features (OCT, OTP, brownout, and OVP) work as expected.

### Prerequisite

Partners can use test load sleds or electronic loads to draw power from the power shelf, instead of assembling the full server rack system.

### Test Conditions

Test under the following input voltage conditions: 240 VAC.

Test with the following load conditions:

- > 10% - 90% dynamic load
- > 100% static load

### Test Procedure

1. OCP: While the rack is running at 100% static load, remove PSUs one by one until hitting OCP
2. OTP: Use thermal chamber, test with slow thermal ramp up to power shelf maximum rating. If a thermal chamber is not available, block airflow to the power shelf until OTP.
3. Brownout and Brown-in: Ramp the input voltage down until hitting brownout UVLO and PSU shuts down, then ramp voltage back up until brown-in recovery. Verify the recovery feature is still in place for slow ramp (lasting several minutes) and fast ramp (lasting several seconds).
4. OVP: Ramp voltage up to power shelf OVP voltage after hitting fault reduce voltage. Verify the protection feature is still in place for slow ramp (lasting several minutes) and fast ramp (lasting several seconds).
5. For each power fault event, check for PSC event records. Verify the fault records are reported correctly.
6. Verify system can operate normally after the system reset and the faults are cleared.

### Pass or Fail Criteria

- > Bus bar power delivery should stay within its acceptable specification range until it turns off.
- > PSC should report faults as expected.
- > System operates normally, after faults are cleared and system is reset.

## 5.1.8 Rack Power AC Apparent Power

### Purpose

The Electrical Design Peak Point (EDPp) testing aims to measure the peak input power within a near-instantaneous timescale, then calculate AC apparent power based on the moving RMS voltage (based on one 60Hz AC cycle).

Peak input power occurs with high transient load, such as the pulse power workload in the Partner Diagnostics.

L11 rack level EDPp test involves measuring input AC power, optional DC power and optional iShare signal. Two 8-channel oscilloscopes might be required to probe AC voltage and current and optional iShare or second power shelf AC input power. The following figure is a block diagram of the test setup.

**Figure 5-1. L11 Power Measurement Block Diagram**



## Assumptions and Estimates

- > The efficiency, power factor, and power sharing among all power shelves are roughly the same.
- > The total L11 rack-level AC input power can be estimated as 4x (or 6x) that of a single power shelf.

## Required Test Equipment

- > 1x 8-channel oscilloscopes (for example, Tektronix 5 or 6 Series MSO B). A second 8-channel oscilloscope is optional for measuring the iShare and second power shelf input power.
- > Optional 1x coaxial cable (for synchronizing oscilloscopes together).
- > 4x current probes (for example, Tektronix TCP0150 current probe); 7x current probes are optional for measuring second power shelf input power.
- > 3x high-voltage differential probes (for example, Tektronix TMDP0200 High-Voltage Differential Probe); 6x high-voltage differential probes are optional for measuring second power shelf input power.
- > 1x low voltage differential voltage probe (for example, Tektronix TDP1500 Differential Probe).
- > 1x passive voltage probe (for example, Tektronix TP1000 Passive Voltage Probe).
- > 1x sacrificial AC power whips; 2x sacrificial AC power whips are optional for measuring input power on two power shelves.



### Caution:

- > DO NOT use probe ground on the AC input unless approved by a specialist familiar with the facility's AC setup at the partner site.
- > High voltage differential voltage probes, such as the Tektronix TMDP0200, should be used to probe the live AC input voltage. Differential voltage probes are used to cancel out noise on the AC network.
- > If passive probes were used to measure the hot wire compared to ground, then the oscilloscope must be isolated from the AC network by power of the scope using battery or isolating transformer.

## Modify AC Power Whip for AC Power Probing

On the AC power cord, perform the following modifications:

1. Remove approximately 15 inches of the outer rubber insulator to reveal individual insulated conductors.
2. On the black, red, orange (hot), and white (neutral) wires, remove about 1/8-inch of insulation to expose the conductors.
3. Attach the positive lead (red) of a high-voltage differential probe to the black (hot) wire, and the negative lead (black) to the neutral (white) wire.
4. Apply Kapton tape around the probes and exposed wires.
5. Place 3x current probes and apply second Kapton tape around the probes to secure all the current probes and voltage probing leads.

Figure 5-2. Power Whip Wiring Diagram



Figure 5-3. Current Probes and Voltage Probes Setup and Power Whip



**Prepare Compute Tray PDB for DC Power Probing (Optional)**

1. (Optional) Open the compute cover and add voltage probing leads and twisted pair wires to the PDB.
2. (Optional) Place the current probe to the wires from the bus bar connectors to the PDB.



3. (Optional) Customize the white Corrugated Twinwall Plastic to cover the top of the compute tray while leaving an opening for the current probe.



4. (Optional) Install the reworked compute tray back to the top tray position

### Connect AC Probes

Do the following per power shelf:

1. Verify the power whip top box is off.
2. Attach a modified AC power whip to the power shelf.
3. Reconnect the differential voltage probe heads to the main part of the probe.
4. Attach a current probe around the exposed black (hot) wire (not around the isolated bundle), with the arrow on the probe pointing towards the system.
5. Connect the current probe to the scope.
6. While no voltage and current in the DUT whip, conduct current probe degauss auto-zero and optional scope compensation (SPC).

**Figure 5-4. Connected Power Cable and Current Probe Attachments Example**



### Setup the Oscilloscope (Optional for the Second Oscilloscope)

1. (Optional) Place the two scopes on carts adjacent to the system.
2. (Optional) Connect the coaxial cable to the “Aux Out” port on the primary scope to the “Ref In” port on the secondary scope.
3. (Optional) On the primary scope’s toolbar, select Utility → I/O and use the “Reference Clock” option under “AUX Out Signal.”

4. (Optional) On the secondary scope, open the acquisition menu and select “External (10 MHz)” under “Timebase Reference Source.”



**Note:** If using 2x 4-channel scopes, two additional requirements must be met:

- > The scopes’ sampling clocks need to be synchronized through wiring and configured according to each scope’s manual.
- > Each scope should dedicate one channel to measure DC voltage, allowing for the calculation of the time offset between the two scopes.

5. Connect voltage and current probes to the scopes.
6. On the scopes, configure the following sample rate and record length. The horizontal scale could be set at 200 second/div to obtain the max sampling duration of 2000-seconds.



7. Channel setup:
  - a. 6x channels for 3-phase AC input voltage and current.
  - b. 2x channels for DC bus bar voltage and current.
  - c. All 8x channels must be synchronized for transient power analysis. An 8x channel scope should be acquired or leased.
8. On the oscilloscope, add labels for each channel and set up the scale and position. See **Section 6.2 “Input EDPp Post-Processing Guideline”** for additional information.



### Setting up the Power Pulse Diag Workload

Use a remote server to run the L11 Partner Diagnostics power pulse diagnostics. Run the CPU and GPU power pulse stress workload for 15 minutes.

### Data Processing and Interpreting Results

1. Monitor and capture the waveforms for AC voltage current, optional DC voltage, and system power.
2. Perform power analysis on the captured waveform for the moving average and apparent power. See **Section 6.2 “Input EDPp Post-Processing Guideline”** for additional information.

### Pass or Fail Criteria

Verify the maximum apparent power is within the specification listed in the *NVIDIA GB200 and GB300 NVL72 Rack System Specifications* (NVOnline: 1117886 and 1126548).

## 5.1.9 Rack Power Noise

### Purpose

Verify that the power shelf output voltage is within maximum ripple and noise. Maximum voltage noise should be less than 5.5V peak-to-peak under the load frequencies.

### Prerequisite

Partners must use a fully populated rack system for this test.

### Test Conditions

Test under the following input voltage conditions: 208, 240, and 277 VAC.

### Test Procedure

1. Run CPU and GPU power pulse stress workload from the GB200 and GB300 NVL72 L11 Partner Manufacturing Diagnostics.
2. Monitor and capture the waveforms for the power shelf output to the entire rack using an oscilloscope with the highest sampling rate allowed.
3. Measure the DC bus bar voltage ( $V_{in}$ ) peak-to-peak value.
4. Monitor PSC event logs for any faults or warnings.

### Pass or Fail Criteria

Verify that the maximum DC bus bar voltage ( $V_{in}$ ) peak-to-peak value is less than 5.5V.

## 5.1.10 Rack Power Firmware Update

### Purpose

Verify that the power shelf PSUs and PSC can successfully perform firmware updates and firmware downgrades.

### Prerequisite

Partners can test without any compute tray or switch tray.

### Test Conditions

- > Test under the following input voltage conditions: 240 VAC.
- > Loading: 100% of N+0 rating. If N+N system, then operate with a single feed while testing.

### Test Procedure

1. Upgrade or downgrade a single PSU at a time. Verify that each PSU firmware can be updated successfully.
2. Upgrade or downgrade 2x PSUs at a time. Verify that both PSUs have the correct firmware flashed
3. Upgrade or downgrade PSC one at a time. Verify both primary and secondary firmware flashed correctly.

### **Pass or Fail Criteria**

PSUs and PSCs should upgrade and downgrade successfully without failures.

## 5.1.11 PSC Reboot Cycles

### **Purpose**

Verify that the PSC (power supply controller) reboot cycle does not affect power delivery.

### **Prerequisite**

Partners can utilize test load sleds or electronic loads to draw power from the power shelf, instead of assembling the full server rack system.

### **Test Conditions**

- > Test under the following input voltage conditions: 240 VAC.
- > Loading: 100% of N+0 rating. If N+N system, then operate with a single feed while testing.

### **Test Procedure**

1. Reboot the PSC one at a time.
2. During reboot, monitor the DC bus bar voltage and verify that the bus bar power remains stable during and after each reboot.
3. Read PSC status telemetry after each reboot. Verify there are no errors.

### **Pass or Fail Criteria**

Bus bar power delivery remains stable during and after each PSC reboot.

## 5.1.12 Rack Power Factor

### **Purpose**

Power factor measures how efficiently the power supply converts electrical power into usable power for the server. This test is to measure the power factor of the rack power supply under various load conditions.

### **Prerequisite**

Partners must use a fully populated rack system for this test.

### **Test Conditions**

Test with the following load conditions:

- > Static light load
- > Static heavy load
- > Dynamic load at low frequency (CPU and GPU power pulse stress at low frequency)

### Test Procedure

1. Run the server rack under each load condition: static light load, static heavy load, and dynamic load at low frequency.
2. Measure the 3-phase AC voltage and current (sampling rate 100 µs/pt) using an oscilloscope
3. Perform power factor calculation with 1, 2, 3, 5, 10, or 60 cycles (60 Hz per cycle). Power factor calculation should be done with the power drawn by the server divided by the product of the 3-phase voltage and current

### Pass or Fail Criteria

- > Power factor for static light load should be greater than 0.90.
- > Power factor for static heavy load should be greater than 0.95.
- > Power factor for dynamic load at low frequency should be greater than 0.85.

## 5.2 L11 Environmental, Reliability

NVIDIA recognizes the diversity of potential use cases that system partners may encounter. The following test lists are the guidance from NVIDIA for the areas concerning environmental factors, reliability, and robustness.

Partners are responsible for validating the system packaging, reliability, and environmental testing to meet their target deployment.

Partners are recommended to modify or conduct additional testing that may be necessary for their specific use case and ensure that all potential scenarios are addressed appropriately.

### 5.2.1 L11 Packaging

#### Purpose

The purpose of performing package testing is to ensure that the packaged rack system can withstand the physical stress it may encounter during transportation, storage, and handling.

#### Test Procedure

Partners should test to the industrial test standards of ISTA. Perform the test items in the following table using the standards highlighted in the “Reference” column.

**Table 5-1. L11 Packaging Test Summary**

| ID    | Subcategory                                                       | Test Item Description                                                                                                                              | Test Procedure                        |
|-------|-------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------|
| RPK.1 | Package/Product Visual Inspection and Product Functional PRE-Test | Run functional test on the product using the latest manufacturing diagnostic. Perform visual inspection and record all noted cosmetic observations | Visual inspection and MFG Diagnostics |

| <b>ID</b> | <b>Subcategory</b>                                                | <b>Test Item Description</b>                                                                                                                                                                                                                            | <b>Test Procedure</b>           |
|-----------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------|
| RPK.2     | Atmospheric Pre-Conditioning                                      | Packaged product shall be preconditioned to laboratory ambient temperature and humidity for minimum of 12 hours                                                                                                                                         | ISTA 3B, Block #1               |
| RPK.3     | Atmospheric Conditioning (Required only for non rigid containers) | [Hot, Humid] then [Extreme Heat, Moderate RH]<br>[38°C, 85% RH for 72 hours] then [60°C, 30% RH for 6 hours]                                                                                                                                            | ISTA 3B, Block #1               |
| RPK.4     | Tip Test                                                          | Test up to 22-degree angle, without letting fall. Report on center of gravity tilt angle.<br>If packaged product fails this test, DO NOT let it fall to the floor.<br>Gently return the product to upright position and note failure. Continue testing. | ISTA 3B, Block #2               |
| RPK.5     | Shock – Rotational Drop                                           | Drop height 6 in (150 mm) for packaged products weighing 500 lb (230 Kg) or more                                                                                                                                                                        | ISTA 3B, Block #5               |
| RPK.6     | Shock – Incline or Horizontal Impact                              | Impact velocity 48 in/sec (1.2 m/sec) minimum or 3 in (76 mm) drops                                                                                                                                                                                     | ISTA 3B, Block #6               |
| RPK.7     | Truck Vibration - Random with top load                            | Overall 0.54 Grms, 120 minutes                                                                                                                                                                                                                          | ISTA 3B, Block #9               |
| RPK.8     | Air Vibration - Random                                            | Combination of High (0.29 Grms), Medium (0.22 Grms) and Low (0.16 Grms) air profile; 120 minutes                                                                                                                                                        | ASTM D4169-23, section 12.4.2.2 |
| RPK.9     | Concentrated Impact (Required only for non rigid containers)      | Free-fall, or guided fall drop                                                                                                                                                                                                                          | ISTA 3B, Block #10              |
| RPK.10    | Fork Lift Handling (Required only for non rigid containers)       | Flat Push and Rotate Tests, Elevated Push and Pull Tests, Elevated Rotate Tests, Load stability test over a handling course.                                                                                                                            | ISTA 3B, Block #15              |
| RPK.11    | Shock – Rotational Drop                                           | Drop height 6 in (150 mm) for packaged products weighing 500 lb (230 Kg) or more                                                                                                                                                                        | ISTA 3B, Block #13              |

### Pass or Fail Criteria

Test items in Table 5-1 are completed, such that the product did not sustain any damage and met the ISTA requirements.

- > Product is damage-free; no structural damage including no detached, loose, fractured, or deformed material parts.
- > Product cosmetic areas are not degraded beyond manufacturing or final acceptance criteria. Cosmetic damage is any abnormality that makes the product unacceptable to the customer.
- > No conductive particles (wire, connectors, and so on) should be exposed because of the testing.
- > The product should meet all manufacturing specifications and tolerances after testing

- > All the product electrical and software functions perform to specification

## 5.2.2 L11 Shock and Vibration Test

### Purpose

The purpose of shock and vibration testing is to ensure that the rack system can withstand physical stress during transportation, handling, and operation. The testing simulates the effects of sudden impacts and continuous or repetitive movements to verify the compute tray's durability and functionality.

### Test Procedure

Perform the test items in the following table using the standards highlighted in the Reference column.

**Table 5-2. L11 Shock and Vibration Test Summary**

| ID     | Subcategory                        | Test Item Description                                                                                                                                                                                           | Expected Results                                                                                  |
|--------|------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
| RSV.1  | Rack SV Vibration Random Operating | Freq Hz / PSD level (G2/Hz) / Slop (dB/oct)<br>5 - 10 / - / +12<br>10 - 50 / 0.0002 / -<br>50 - 100 / - / +12<br>30 mins / per axes ; 3 axes 0.108 Grms                                                         | No permanent system damage, no parts or connectors dislodged, and system stays operational        |
| RSV.6  | Rack SV Handling Obstacle          | Roll over a 4mm vertical step with each caster independently at 0.8m/s, 5 times each                                                                                                                            | This can be done successfully without damage or needing special equipment and meets specification |
| RSV.7  | Rack SV Handling Gap               | Traverse 25.4mm wide gap in the floor at 0.5m/s, 5 times                                                                                                                                                        | This can be done successfully without damage or needing special equipment and meets specification |
| RSV.8  | Rack SV Handling Ramp              | Transition a 5-degree ramp both up and down. From a flat surface, traverse a 5 degree ramp up to an elevated flat surface. From an elevated flat surface, traverse down a 5-degree ramp down to a flat surface. | This can be done successfully without damage or needing special equipment and meets specification |
| RSV.9  | Rack SV Handling Roll-off          | 19mm, 5 times per direction (with front leading and then with rear leading)                                                                                                                                     | This can be done successfully without damage or needing special equipment and meets specification |
| RSV.10 | Rack SV Handling Distance Roll     | Roll 800m on a concrete floor at 0.8m/s, without stopping for prescribed distance. After completing the four tests                                                                                              | This can be done successfully without                                                             |

| ID     | Subcategory                                | Test Item Description                                                                                                                                                                                                                                                                                                                                                                                       | Expected Results                                                      |
|--------|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|
|        |                                            | above, the force (kgw) required to push the rack from a non-moving position along a smooth, flat cement floor SHALL be less than 5% the total combined weight (kgw) of the rack and IT Gear.                                                                                                                                                                                                                | damage or needing special equipment                                   |
| RSV.11 | Rack SV Handling Tilt                      | Tilted to an angle of 10 degrees from its normal upright position and held in this position for 1 minute. Repeated for all four sides (front, back, left, right).<br><br>(1) Test with maximum payload, evenly distributed in rack. Add-on stabilizers MAY be used to meet this requirement for rack at max payload.<br><br>(2) base rack with no payload SHALL also pass with no stabilizer                | This can be done successfully without damage and meets specification. |
| RSV.12 | Rack SV Handling Leveling                  | Leveling feet SHALL be raised and lowered 3 cycles individually to raise rack to 15 mm off the floor until all 4 are raised. Start with front left, then rear right, rear left, and finally front left. Lower in reverse order. Repeat for 3 cycles.                                                                                                                                                        | This can be done successfully without damage and meets specification. |
| RSV.13 | Rack SV Handling Lateral Load              | While in its normal position on a flat surface, with casters rotated towards the surface, a force equal to 20 percent of the weight of the fully loaded enclosure system, but not more than 250 N (56.2 lbf), is applied in any direction except upwards, at a height not exceeding 2 m (78.74 in) from the floor. The force is applied to the front, back, and each side of the system, each for 1 minute. | This can be done successfully without damage or falling over.         |
| RSV.14 | Rack NVLink Traffic + Stress Test after SV | Run the L11 Diag traffic test on the rack                                                                                                                                                                                                                                                                                                                                                                   | Relevant test workload is passing.                                    |

### Pass or Fail Criteria

Test items in Table 5-2 are completed and passed.

## 5.2.3 L11 Environmental Reliability Test

### Purpose

The purpose of the environmental reliability test is to evaluate the hardware's ability to operate effectively and reliably under various environmental conditions, such as temperature extremes, humidity, and altitude.

### Test Procedure

Perform the test items in the following table using the test conditions outlined in the table.

**Table 5-3. L11 Reliability Test Summary**

| ID   | Subcategory                                               | Test Item Description                                                                                                                                                                                                                                | Expected Results                                                                                                               |
|------|-----------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|
| RL.1 | Rack Reliability Pre Baseline                             | No device missing and diag must pass.                                                                                                                                                                                                                | The system is stable and passes all tests.                                                                                     |
| RL.2 | Rack Reliability Temperature and Humidity (Non-operating) | 1. 25°C, 50% RH for 1 hour<br>2. -40°C, No Humidity control for 16 hours<br>3. 70°C, 5% RH for 16 hours<br>4. 70°C, 95% RH for 16 hours<br>5. 40°C, 93% RH for 96 hours<br>6. 25°C, 50% RH for 4 hour<br>7. Max Ramp rate : 20°C per hour; 20% RH/hr | No permanent system damage or degradation.                                                                                     |
| RL.3 | Rack Reliability Temperature and Humidity (Operating)     | 1. 25°C, 50% RH for 1 hour<br>2. 0°C, No Humidity control for 24 hours<br>3. Max Temp °C, 50% RH for 24 hours<br>4. 31°C, 85% RH for 24 hours<br>5. 25°C, 10% RH for 24 hours<br>6. Max Ramp rate: 10°C per hour; 10% RH/hr                          | No permanent system damage or degradation.<br>Liquid cooled: max temperature per design spec (condensation must be prevented). |
| RL.4 | Rack Reliability Connector Durability                     | 50 cycles / OF<br>Test system connectors: cable cartridge, PSUs, and network ports                                                                                                                                                                   | No permanent system damage or degradation                                                                                      |
| RL.5 | Rack Reliability Manifold Mate Durability                 | 50 cycles / OF<br>Test trays and manifold connection                                                                                                                                                                                                 | No connector damage or leakage                                                                                                 |
| RL.6 | Rack Reliability Hard Boot low temperature                | 1. 0°C for 500 power shelf AC cycles<br>2. 500 cyc / 2F<br>(Condensation must be prevented)                                                                                                                                                          | No boot failures and no permanent system damage or degradation.                                                                |
| RL.7 | Rack Reliability Hard Boot high temperature               | 1. Max Temp °C, 50% RH for 500 power shelf AC cycles<br>2. 500 cyc / 2F                                                                                                                                                                              | No boot failures and no permanent system damage or degradation.                                                                |

| <b>ID</b> | <b>Subcategory</b>                                      | <b>Test Item Description</b>                                                                                | <b>Expected Results</b>                                                                               |
|-----------|---------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|
| RL.13     | Rack Reliability Soft Boot (OS reboot) low temp         | 1. 0°C for 500 boot cycles<br>2. 500 cyc / 3F<br>(Condensation must be prevented)                           |                                                                                                       |
| RL.14     | Rack Reliability Soft Boot (OS reboot) high temperature | 1. Max Temp °C, 50% RH for 500 boot cycles<br>2. 500 cyc / 3F                                               |                                                                                                       |
| RL.8      | Rack Reliability Soak and Boot low temp                 | 1. 25°C for 3000 power shelf boot cycles<br>2. 3000 cyc / 6F                                                | No boot failures and no permanent system damage or degradation.                                       |
| RL.10     | Rack Reliability CMTBF Analysis                         | Telcordia SR-332 Issue 4<br>Method I, Case 1<br>25°C and max temp °C<br>Ground Benign operating environment | Ensure system CMTBF meets expectations, and if there are preventable failures, those are highlighted. |
| RL.12     | Rack Reliability Post Baseline                          | No device missing and diag must pass.                                                                       | The system is stable and passes all tests, and performance is equivalent to the pretest baseline.     |

### Pass or Fail Criteria

The test items in Table 5-3 are completed and passed.

---

# Chapter 6. Appendix 1: Oscilloscope Acquisition Modes for EDPP Measurements

## 6.1 Moving Average Measurement using Oscilloscope HiRes Mode

When using the “High Res” acquisition for input EDPP measurement, the oscilloscope acquisition mode to “High Res” or “Hi Res” mode.

The “High Res” mode in most oscilloscopes oversamples and calculates the average of all the samples for each acquisition interval. Input EDPP current and power measurements can be performed via the “HiRes” mode, as this mode performs a boxcar averaging on the samples within an acquisition interval.

**Figure 6-1. Acquisition Mode**

For example, a 5 kS/ s interval equates to a 200  $\mu$ s/pt moving average timescale, and 10 kS/ s interval equates to a 400  $\mu$ s/pt moving average timescale.

Oscilloscope settings for 200  $\mu$ s timescale moving average measurement

- > **X-Axis (suggested):** 20 ms/division or custom to capture the transient
- > **Y-Axis (suggested):** 5 A/division or 10 A/division based on the maximum expected current transient level
- > **Trigger:** Set as edge trigger. First, determine the “idle” current level using AUTO trigger. A good edge trigger level will be between the idle limit and the expected average power capping limit to capture the initial spike.
- > **Persistence mode:** Off
- > **Vertical setting:** Channel 1/2/3/4
  - Set termination to 50 ohm and bandwidth to 20 MHz to reduce noise
- > **Horizontal/ Acquisition tab:** Acquisition
  - Set acquisition mode to “High Res,” set Fast Acquisition mode to Off, set Roll Mode to Off, and Sampling Mode to Real-Time.
- > **“HiRes” Averaging:** 200  $\mu$ s/pt

The following image shows an example oscilloscope setting for a 200  $\mu$ s moving average timescale measurement.

**Figure 6-2. Oscilloscope Setting 200 µs Moving Average Example**

## 6.2 Input EDPP Post-Processing Guideline

When using the post-processing technique for input EDPP calculation, the oscilloscope acquisition mode to sample mode. When performing input EDPP moving average analysis on captured oscilloscope waveform files (\*.wfm), the following procedures can be implemented programmatically for input EDPP analysis.

### 1. Waveform Data Loading and Processing:

- Load waveform data from multiple .wfm files, representing current and voltage waveforms for both DC and AC.
- Calculate the start and stop times for data extraction based on user-defined stop/start parameters. For example,
- Set up scope sampling rates, horizontal/vertical scales, and other parameters required for processing the waveform data. For example, a good sampling rate for input EDPP analysis is 100 µs/pt with a total record length of 1500 seconds (150 second per horizontal division). A sampling rate of 100 µs/pt means that the sampling frequency is 10,000 samples per second. As a result, the moving average window and the number of samples can be determined by the following formulas:

$$\text{Scope sampling frequency} = \frac{1}{\text{Recommended Scope Sampling Rate}}$$

$$\text{Number of samples per window} = \text{Moving average window (sec)} * \text{Sampling Frequency (samples/sec)}$$

**Table 6-1. Input EDPP and Samples per Window in Sample Mode**

| Timescale               | Moving Average Window | Scope Sampling Rate | Scope Sampling Frequency (Hz) | Number of Samples per Window |
|-------------------------|-----------------------|---------------------|-------------------------------|------------------------------|
| Input EDPC              | 1 second              | 100 µs/pt           | 10,000 Hz                     | 10,000                       |
| Input EDPP <sub>1</sub> | 50 ms                 | 100 µs/pt           | 10,000 Hz                     | 500                          |
| Input EDPP <sub>1</sub> | 50 ms                 | 40 µs/pt            | 25,000 Hz                     | 1250                         |
| Input EDPP <sub>2</sub> | 400 µs                | 100 µs/pt           | 10,000 Hz                     | 4                            |
| Other                   | 20 ms                 | 100 µs/pt           | 10,000 Hz                     | 200                          |

## 2. Data Extraction and Scaling:

- a. Extracts specific data segments based on calculated start times, scales, and offsets.
- b. Processes each loaded waveform for each channel (current\_ch1, current\_ch2, voltage\_ch1, and so on) to fit within the desired time span and resolution.

## 3. Instantaneous Power Calculation:

- a. Computes instantaneous AC power for each waveform by multiplying voltage and current measured on input AC lines (for example, Instantaneous AC power = AC Voltage \* AC Current).
- b. Calculates the total instantaneous AC power by summing all individual power values from each phase.
- c. Computes instantaneous DC power from DC voltage and current data.

## 4. AC Apparent Power Analysis for AC Power and Peak Calculations:

- a. Calculates the apparent power (RMS voltage multiplied by RMS current) for each AC voltage/ current channel and stores its value.
- b. Determines the maximum apparent power over 1x AC cycle (60 Hz) for peak apparent power analysis.

## 5. DC Power Analysis with Input EDPp Moving Average Windows:

- a. For DC power analysis, performs moving average input EDPp calculation for DC voltage and currents over each EDP time intervals (400 µs, 50 ms, 1 second).
- b. On the captured waveform values, perform a sliding window of samples, and perform averaging across all the values. Refer to Table 6-1. A simple moving average algorithm example can be implemented as follows:

```
# Define the window size for the moving average
window_size = 500

# Initialize an empty list to store the moving averages
moving_average = []

# Calculate the moving average manually
for i in range(len(data) - window_size + 1):
    window = data[i:i + window_size]
    window_average = np.mean(window)
    moving_average.append(window_average)

# Convert the list to a numpy array
moving_average = np.array(moving_average)
```

- c. DC input EDPp values should be less than the value provided in the product specification.

## 6. Plotting and data visualization:

- a. Creates a figure with multiple subplots, each showing data like apparent power and DC power over time.

- b. Uses the duty cycle and frequency parameters to title the plots, showing system information.
  - c. Provides a visualization of apparent power across multiple power supply units, a DC power plot, and related data.
7. **Summary Calculations:**
- a. Finds the peak-to-peak voltage and current for DC power and the maximum values for apparent power and moving averages of DC power.

Rik Kisnah NVIDIA Confidential Oracle Labs - NVL  
1122395 2025-11-28 16:00:27

---

# **Chapter 7. Appendix 2: Validation Leverage Guidance for GB300 NVL72 Systems**

This chapter outlines the test items that can be leveraged from the GB200 NVL72 PVP Test plan as part of the overall GB300 NVL72 qualification. If partners have performed the full PVP validation on the GB200 NVL72 system, then some of the test items can be leveraged from the GB200 and deemed as “Optional” as part of the GB300 NVL72 qualification.

As part of the PVP leverage guidance, a spreadsheet with a list of PVP test items that can be leveraged from the GB200 NVL72 PVP is attached to this validation guide.

The attached file name is “GB\_NVL72\_PVP\_Validation\_Leverage\_Guidance.nvzip.” Download the .nvzip file to the local directory, rename the file to a .zip file, and unzip the file to extract the contents. See Section 1.4 for downloading details.

## **Notice**

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation ("NVIDIA") makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.

NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.

Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete. NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer ("Terms of Sale"). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer's own risk.

NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer's sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer's product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.

No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.

Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.

THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA's aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

## **Trademarks**

NVIDIA, the NVIDIA logo, BlueField, ConnectX, NVIDIA Grace, NVIDIA MGX, and NVLink are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

## **Arm**

Arm, AMBA, and ARM Powered are registered trademarks of Arm Limited. Cortex, MPCore, and Mali are trademarks of Arm Limited. All other brands or product names are the property of their respective holders. "Arm" is used to represent ARM Holdings plc; its operating company Arm Limited; and the regional subsidiaries Arm Inc.; Arm KK; Arm Korea Limited.; Arm Taiwan Limited; Arm France SAS; Arm Consulting (Shanghai) Co. Ltd.; Arm Germany GmbH; Arm Embedded Technologies Pvt. Ltd.; Arm Norway, AS, and Arm Sweden AB.

## **Copyright**

© 2024, 2025 NVIDIA Corporation. All rights reserved.