Skip to content
Florian Forster edited this page Nov 26, 2023 · 1 revision

Name:

Mcelog plugin

Type:

read

Callbacks:

config, init, read, shutdown

Status:

supported

FirstVersion:

5.8

Copyright:

2016–2017 Intel Corporation

License:

Manpage:

collectd.conf(5)

See also:

List of Plugins

The purpose of mcelog plugin is to send notifications and stats relevant to Machine Check Exceptions (MCE) when they occur. The plugin leverages the mcelog Linux utility to detect that an exception has occurred. mcelog supports a client server model and does the logging and accounting of exceptions when they occur. The plugin simply leverages the client protocol of mcelog to detect when an exception has occurred. The goal of this equivalence feature is to expose Reliability, Availability and Serviceability (RAS) features metrics and events provided by the platform to higher level fault management applications. The plugin does the following:

  • Checks mcelog server liveliness, reports a failure if it’s not running or if it fails.
  • Retrieve aggregated Memory Corrected and Uncorrected Errors from the client protocol (Submit event/stat).

Mcelog must be configured to run on the platform in daemon mode and logging capabilities must be enabled. For a full description of available options please refer to the collectd.conf(5) manual page.

Synopsis

  <Plugin mcelog>
    <Memory>
      McelogClientSocket "/var/run/mcelog-client"
      PersistentNotification false
    </Memory>
  </Plugin>

Will be changed after branch "feat_mcelog_mem_notification_level" is merged (default if all commented for now is socket):

 # <Plugin mcelog>
 #   <Memory>
 #     McelogClientSocket "/var/run/mcelog-client"
 #     PersistentNotification false
 #   </Memory>
 #   McelogLogfile "/var/log/mcelog"
 # </Plugin>

Parameters

None yet

Metrics


Metric/Feature Name


Date Type


Format Example


Internal Collectd Version


Description


Dependencies


Limitations


Comments


Memory corrected errors


Int


51522


None


Number of Corrected memory errors since the system boot






gets metrics from mcelog daemon.


Memory corrected errors in 24 Hours


Int


51522


None


Number of Corrected memory errors since previous 24 hours






gets metrics from mcelog daemon.


Memory Uncorrected errors


Int


51522


None


Number of Corrected memory errors since the system boot






gets metrics from mcelog daemon.


Memory Uncorrected errors in 24 Hours


Int


51522


None


Number of Corrected memory errors since previous 24 hours






gets metrics from mcelog daemon.


Socket


Int


0


None


Socker number error occurred on






gets metrics from mcelog daemon.


Channel


Char


0


None


Memory channel each channel represents a DIMM module






gets metrics from mcelog daemon.


Memory DIMM


Char


B1


None


Memory DIMM corresponding the memory used by the cores errors occurred on






gets metrics from mcelog daemon.


Memory Slot


Char


1


None


Memory slot corresponding the memory used by the cores errors occurred on






gets metrics from mcelog daemon.


CPU ID


Int


0


Future


CPU ID of the cores errors occurred on. Will be added to new EDAC plugin





Memory Page


Hex


0x12345


Future


Memory page corresponding the memory used by the cores errors occurred on. Will be added to new EDAC plugin






Not part of Collectd. Currently available with kernel EDAC logs


Memory Offset


Hex


0x0


Future


Memory offset in the page. Will be added to new EDAC plugin






Not part of Collectd. Currently available with kernel EDAC logs


Memory Row


Hex


0x12345






Not part of Collectd. Currently available with kernel EDAC logs


Memory Grain


Int


8


Future


The byte granularity or the error grain. Will be added to new EDAC plugin






Not part of Collectd. Currently available with kernel EDAC logs


Error Syndrome


Hex


0x6ce3


Future


Memory syndrome corresponding the memory used by the cores errors occurred on. Will be added to new EDAC plugin






Not part of Collectd. Currently available with kernel EDAC logs


Error Type


Text



Future


Error type. Will be added to new EDAC plugin






Not part of Collectd. Currently available with kernel EDAC logs


Error code


Integer


0101:0090


Future


Error code put out by EDAC. Will be added to new EDAC plugin






Not part of Collectd. Currently available with kernel EDAC logs


Logging


Log path




Configurable logging path






Not part of Collectd. Currently available with kernel EDAC logs


dimmX or rankX directory info


Varying



Future


Expose interface files provided by sysfs through mcX/dimmX or rankX directories






Not part of Collectd. Currently available with kernel EDAC logs


csrowX directory info


Varying



Future


Expose interface files provided by sysfs through mcX/csrowX directories






Not part of Collectd. Currently available with kernel EDAC logs


RAS interrupts


Count on each core


[CoreID]:[InterruptCont]


Future


Expose the RAS related interrupts on cores of interest via Collectd






Discussion open to see if this info can be exposed through the plugin.

Example Graph

None yet. Add one now!

Dependencies

Also See

RAS/mcelog Plugin High Level Design Tests Executed

Clone this wiki locally