Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk and filesystem error metrics #3005

Open
Sandelinos opened this issue Apr 26, 2024 · 4 comments · May be fixed by #3047
Open

Disk and filesystem error metrics #3005

Sandelinos opened this issue Apr 26, 2024 · 4 comments · May be fixed by #3047
Labels
enhancement platform/Linux Linux specific issue

Comments

@Sandelinos
Copy link

I recently had a disk fail on a system, which I found out from errors in dmesg. (blk_update_request: critical medium error)

I wanted to set up some alerts on prometheus so I could get notified the next time the same thing happens but couldn't find any metric from node exporter on the machine that indicated anything was wrong. The only disk error related metric I found is node_filesystem_device_error, which just returns the errors returned from the statfs syscall.

I went digging around in sysfs on the machine and found data about ext4 filesystem errors in these files:

  • /sys/fs/ext4/<partition>/errors_count: number of ext4 errors (commit)
  • /sys/fs/ext4/<partition>/warning_count: number of ext4 warning log messages (commit)
  • /sys/fs/ext4/<partition>/msg_count: number of other ext4 log messages

...and SCSI disk errors in these files (hexadecimal):

  • /sys/block/<disk>/device/ioerr_cnt: number of SCSI commands that completed with an error
  • /sys/block/<disk>/device/iodone_cnt: number of completed or rejected SCSI commands

I think node exporter should export these metrics. Maybe somewhat like this:

# HELP node_ext4_errors Number of ext4 filesystem errors.
# TYPE node_ext4_errors counter
node_ext4_errors{device="/dev/sda1"} 123
# HELP node_ext4_warnings Number of ext4 filesystem warning messages.
# TYPE node_ext4_warnings counter
node_ext4_warnings{device="/dev/sda1"} 456
# HELP node_ext4_messages Number of ext4 filesystem messages.
# TYPE node_ext4_messages counter
node_ext4_messages{device="/dev/sda1"} 78
# HELP node_disk_ioerr_total Number of SCSI commands that completed with an error.
# TYPE node_disk_ioerr_total counter
node_disk_ioerr_total{device="/dev/sda"} 1000
# HELP node_disk_iodone_total Number of completed or rejected SCSI commands.
# TYPE node_disk_iodone_total counter
node_disk_iodone_total{device="/dev/sda"} 9999
@SuperQ SuperQ added enhancement platform/Linux Linux specific issue labels Apr 26, 2024
@sasa-tomic
Copy link

This would be very useful for us as well. Any update on this?
FWIW, we are primarily interested in XFS.

@mshahzeb
Copy link

mshahzeb commented Jun 3, 2024

Hi @sasa-tomic Currently I am working on a PR for this.

@Alex-wwei
Copy link

WOW, That will be very cool.

@mshahzeb mshahzeb linked a pull request Jun 10, 2024 that will close this issue
@mshahzeb
Copy link

mshahzeb commented Jun 10, 2024

PR: #3047 - first draft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement platform/Linux Linux specific issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants