Skip to content

[WIP] What is a SEV

Suraj Subramanian edited this page Jul 24, 2023 · 1 revision

OSS CI SEV

"OSS CI SEV" represents the incident response process for PyTorch OSS CI, including incidents that breaks the HUD status, trunk health, PR health, and CI infrastructure stability. The goal of ci: sev process is to maintain a healthy trunk for better developer experience.

Detecting CI SEV

Reporting CI SEV

Create an issue that clearly indicates the scope and the impact area. Tag the issue with ci: sev label so that it appears on the HUD. https://hud.pytorch.org/build2/pytorch-master

Mitigating CI SEV (Runbook)

  • Raise the awareness. SEV events visibility on HUD should be able to help tree-hugger oncalls to clarify if some "test failures" are SEV or infra flaky issues.
  • Notify the related tests' owner team.
  • Escalate the issue with high priority label if necessary
  • After the issue is resolved, simply close the issue (but don't remove the label ci: sev).

Review Meeting

Clone this wiki locally