A curated list of Site Reliability and Production Engineering resources.
-
Updated
Dec 3, 2023
A curated list of Site Reliability and Production Engineering resources.
Hands on labs and code to help you learn, measure, and build using architectural best practices.
Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
A checklist of anyone practicing Site Reliability Engineering
The k6 documentation website.
Chaos Engineering Toolkit & Orchestration for Developers
A curated list of Site Reliability and Production Engineering Tools
This repository provides a design methodology and approach to building highly-reliable applications on Microsoft Azure for mission-critical workloads.
No longer maintained: Puppet module for aptly
Reliability engineering toolkit for Python - https://reliability.readthedocs.io/en/latest/
Probabilistic Risk Analysis Tool (fault tree analysis, event tree analysis, etc.)
The Chaos Toolkit core library
OpenShift Guide. Learn about the Red Hat OpenShift Container Platform, Data Science, Code Ready Containers, Podman, Buildah, and Kubernetes.
GOV.UK PaaS - Cloud Foundry
A terraform provider for Concourse
Serverless chaos monkey for AWS (runs on AWS Lambda) ☁️ 💥
Technical documentation for GOV.UK PaaS
Terraform configuration to manage a Prometheus server running on AWS.
CloudFoundry prometheus metrics exporter.
Add a description, image, and links to the reliability-engineering topic page so that developers can more easily learn about it.
To associate your repository with the reliability-engineering topic, visit your repo's landing page and select "manage topics."