Skip to content
Florian Forster edited this page Nov 26, 2023 · 1 revision

Name:

PCIe_Errors plugin

Type:

read

Callbacks:

init, config, read, shutdown

Status:

supported

FirstVersion:

5.10

Copyright:

2019 Intel Corporation

License:

Manpage:

See also:

List of Plugins

The purpose of this feature is to monitor and report PCI Express errors. There are two mechanisms for error handling. First is base line which is mandatory for every PCIe device, but provides only limited information as there are only four error types. It resides in Device Status register of PCI Express capability. The second is extended capability with Advance Error Reporting. It can provide detailed information about errors set on device. Its occurrence is optional, and not every device provides this extended information.

Synopsis

see PCIe Errors High Level Design

 <Plugin pcie_errors>
        Source "sysfs"
        AccessDir "/sys/bus/pci"
        ReportMasked false
        PersistentNotifications false
        FirstFullRead false
        LogFile "/var/log/syslog"
        <MsgPattern "AER">
                <Match>
                        Name "aer error"
                        Regex "AER:.*error received"
                        SubmatchIdx -1
                </Match>
                <Match>
                        Name "incident time"
                        Regex "(... .. ..:..:..) .* pcieport.*AER"
                        IsMandatory false
                </Match>
                <Match>
                        Name "root port"
                        Regex "pcieport (.*): AER:"
                </Match>
                <Match>
                        Name "device"
                        Regex " ([0-9a-fA-F:\\.]*): PCIe Bus Error"
                </Match>
                <Match>
                        Name "severity"
                        Regex "severity=([^,]*)"
                </Match>
                <Match>
                        Name "error type"
                        Regex "type=(.*),"
                        IsMandatory false
                </Match>
                <Match>
                        Name "id"
                        Regex ", id=(.*)"
                </Match>
        </MsgPattern>
 </Plugin>

Parameters

needs to be added

Metrics


Metric/Feature/Input


Name


Date Type


Format Example


Description


Dependencies


Limitations


Comments



PCIe AER Plugin


-


-


Plugin to provide PCIe AER metrics, errors, notifications & device information


Depends on sysfs and proc file systems


To be used on little endian systems.



Feature


Device Domain


Hex


10


The PCI address domain consisting of three distinct address spaces: configuration, memory, and I/O space.


None




Feature


Device Bus


Hex


10


PCIe Bus number


None




Feature


Device ID


Hex


3597


PCIe Device ID of the device


None




Feature


Device Function


Hex


10


Bus:Device.Function notation used to succinctly describe PCI and PCIe devices


None




Feature


Instance Type


Text


correctable/uncorrectable


PCIe instance type


None




Feature


Severity


Text


Fatal/Non-fatal


Severity flag indicating nature of severity of uncorrectable errors with fatal or non-fatal error types


None




Feature


Persistent Notification


Text


True/False


If any uncorrectible error is already reported once, persistent flag is set in the plugin and not reported again


None




Metric


Uncorrectable Error


Text


uncorrectable


The errors which don’t have impact on integrity of the PCI Express fabric, but data/information is lost. Non-fatal errors are corrupted transactions that can’t be corrected by PCIe hardware.

However, the PCI Express fabric continues to function correctly and other transactions are unaffected, only particular transaction is affected. Recovery from a non-fatal error may or may not, depends on device-specific software associated with the requester that initiated the transaction


None




Metric


Correctable Error


Text


correctable


the errors which may have an impact on performance (like latency, bandwidth), but no data/information is lost and PCIe fabric remains reliable. Such errors are corrected by hardware and no software intervention is required


None




Metric


Severity Non-Fatal Error


Text


non_fatal


Error severity indicating no reboot necessary


None




Metric


Severity Fatal Error


Text


fatal


Error severity indicating reboot necessary


None




Metric


Unsupported Request


Text


unsupported


This error occurs when an endpoint or a root port recieves any of a set of transactions as defined by PCIe Spec defined in [1]. In all cases the TLP is deleted in the Hard IP block and not presented to the Application Layer. If the TLP is a non-posted request, the Hard IP block generates a completion with Unsupported Request status.


Depends on what's exposed in sysfs and proc file systems




Metric


Data Link Protocol Uncorrected Error


Text


Data Link Protocol


This error occurs when a sequence number specified by the Ack/Nak block in the Data Link Layer (AckNak_Seq_Num) does not correspond to an unacknowledged TLP.


Depends on what's exposed in sysfs and proc file systems




Metric


Surprise Down Uncorrected Error


Text


Surprise Down


When the PCIe device goes down without a notice


Depends on what's exposed in sysfs and proc file systems




Metric


Poisoned TLP Uncorrected Error


Text


Poisoned TLP


anytime a poisoned TLP is destined to PCIe device, IIO module will drop the poisoned data packet, contain the error in the domain that it was detected in, bring down the link, and signal a fatal error to SW/FW


Depends on what's exposed in sysfs and proc file systems




Metric


Flow Control Protocol Uncorrected Error


Text


Flow Control Protocol


An uncorrected error in flow control protocol found in transaction layer that prevents flow control credits transactions being sent. This error occurs when a component does not receive update flow control credits with the 200 µs limit.


Depends on what's exposed in sysfs and proc file systems




Metric


Completion Timeout Uncorrected Error


Text


Completion Timeout


This error occurs when a request originating from the Application Layer does not generate a corresponding completion TLP within the established time. It is the responsibility of the Application Layer logic to provide the completion timeout mechanism. The completion timeout should be reported from the Transaction Layer using the cpl_err[0] signal.


Depends on what's exposed in sysfs and proc file systems




Metric


Completer Abort Uncorrected Error


Text


Completer Abort


The Application Layer reports this error using thecpl_err[2]signal when it aborts receipt of a TLP.


Depends on what's exposed in sysfs and proc file systems




Metric


Unexpected Completion Uncorrected Error


Text


Unexpected Completion


This error is caused by an unexpected completion transaction as listed in [1]. The TLP is not presented to the Application Layer; the Hard IP block deletes it.


Depends on what's exposed in sysfs and proc file systems




Metric


Receiver Overflow Uncorrected Error


Text


Receiver Overflow


This error occurs when a component receives a TLP that violates the FC credits allocated for this type of TLP. In all cases the hard IP block deletes the TLP and it is not presented to the Application Layer.


Depends on what's exposed in sysfs and proc file systems




Metric


Malformed TLP Uncorrected Error


Text


Malformed TLP


This error is caused by an unexpected completion transaction as listed in [1]. The TLP is not presented to the Application Layer; the Hard IP block deletes it.


Depends on what's exposed in sysfs and proc file systems




Metric


ECRC Uncorrected Error Status


Text


ECRC


ECRC ensures end-to-end data integrity for systems that require high reliability. When the ECRC generation option is turned on, errors are detected when receiving TLPs with a bad ECRC. More details in [2]


Depends on what's exposed in sysfs and proc file systems




Metric


Unsupported Uncorrected Error Request


Text


Unsupported


This error is caused by an unexpected completion transaction as listed in [1]. The TLP is not presented to the Application Layer; the Hard IP block deletes it.


Depends on what's exposed in sysfs and proc file systems




Metric


ACS Violation Uncorected Error


Text


ACS Violation


Violation in Access Control Services. More details in [3]


Depends on what's exposed in sysfs and proc file systems




Metric


Internal Uncorrected Error


Text


Internal Uncorrected


An error associated with a PCI Express interface that occurs within a component and which may not be attributable to a packet or event on the PCI Express interface itself or on behalf of transactions initiated on PCI Express. More details in [4]


Depends on what's exposed in sysfs and proc file systems




Metric


MC Blocked TLP Uncorrected Error


Text


MC Blocked TLP


An error with Multicast TLP processing. More details in [5]


Depends on what's exposed in sysfs and proc file systems




Metric


Atomic Egress Blocked Uncorrected Error


Text


Atomic Egress Blocked


Error with setting AtomicOp Egress Blocking bit. More details in [6]


Depends on what's exposed in sysfs and proc file systems




Metric


TLP Prefix Blocked Uncorrected Error


Text


TLP Prefix Blocked


The TLP Prefix mechanism extends the header size by adding DWORDS to the front of headers that carry additional information. The uncorrected error reflects failure in the process. More details in [7]


Depends on what's exposed in sysfs and proc file systems




Metric


Receiver Error Status Corrected Error


Text


Receiver Error Status


Receiver error at PCIe physical layer


Depends on what's exposed in sysfs and proc file systems




Metric


Bad TLP Status Corrected Error


Text


Bad TLP Status


This error occurs when a LCRC verification fails or when a sequence number error occurs.


Depends on what's exposed in sysfs and proc file systems




Metric


Bad DLLP Status Corrected Error


Text


Bad DLLP Status


This error occurs when a CRC verification fails.


Depends on what's exposed in sysfs and proc file systems




Metric


Replay NUM Rollover Corrected Error


Text


Replay NUM Rollover


This error occurs when the replay number rolls over.


Depends on what's exposed in sysfs and proc file systems




Metric


Replay Timer Timeout Corrected Error


Text


Replay Timer Timeout


This error occurs when the replay timer times out


Depends on what's exposed in sysfs and proc file systems




Metric


Advisory Non-Fatal Corrected Error


Text


Advisory Non-Fatal


The error are reported and signaled as ERR_COR, ERR_NONFATAL, ERR_FATAL or not signaled at all, depending upon the role of the agent that detects the error and whether the agent implements AER as an advisory capacity to application. More details in [8]


Depends on what's exposed in sysfs and proc file systems




Metric


Corrected Internal Corrected Error


Text


Corrected Internal


An error associated with a PCI Express interface that occurs within a component and which may not be attributable to a packet or event on the PCI Express interface itself or on behalf of transactions initiated on PCI Express. More details in [4]


Depends on what's exposed in sysfs and proc file systems




Metric


Header Log Overflow Corrected Error


Text


Header Log Overflow


When a header is logged, the header is that of the first TLP that was lost or corrupted by the Uncorrectable Internal Error. More detilas in [9]


Depends on what's exposed in sysfs and proc file systems



Example Graph

None yet. Add one now!

Dependencies

Depends on sysfs and proc file systems

Caveats

History

See also

Clone this wiki locally