Skip to content

operate-first/operations

Repository files navigation

Site Reliability Engineering (SRE) Support

This repository contains all the SRE (Site Reliability Engineering) principles and guidelines for managing the Operate First services.

What is SRE?

SRE is a software engineering approach to manage operations for systems, applications and services. We use software as a tool to manage systems, solve problems, and automate operations tasks.

Get started

If you'd like to learn and get hands on experience with SRE practices, but aren't sure where or how to start, let us help!

  1. Follow this link to find beginner friendly issues.
  2. Tag yourself in the issue
  3. Join the Slack and let us know that you're interested in helping by posting in the #support channel a short introduction of yourself and a link to the issue you'd like to complete.

To learn more, check out the incident management procedures, GitHub receiver setup, learn to configure Prometheus alerts, or browse the GitHub repo.