Notre avis
Définit des objectifs de niveau de service, gère les budgets d'erreur et conduit des revues d'incidents pour équilibrer fiabilité et vélocité.
Points forts
- Quantifie la fiabilité avec des SLOs mesurables.
- Fournit un mécanisme clair de gel de code via le budget d'erreur.
- Encourage des post-mortems sans blâme pour apprendre des incidents.
- Automatise les runbooks et réduit les alertes inutiles.
Limites
- Nécessite des données de monitoring précises et fiables.
- Peut être difficile à implémenter sans culture DevOps préexistante.
- Les SLOs trop agressifs mènent à des tensions entre équipes.
Utilisez cette compétence pour définir des critères de stabilité objectifs et gérer les compromis entre nouvelles fonctionnalités et fiabilité.
Évitez si l'infrastructure est encore en construction ou si les équipes n'ont pas les outils de monitoring nécessaires.
Analyse de sécurité
SûrThis skill provides guidance on SRE practices such as defining SLIs, setting SLOs, and incident management. It does not involve any executable instructions or use of tools that could compromise security. Therefore, it poses no meaningful execution risk.
Aucun point d'attention détecté
Exemples
Help me define SLOs for our e-commerce API. We want 99.9% availability over a 30-day rolling window and latency below 200ms for 95% of requests.We had a production outage last night due to a database migration failure. Can you guide me through a blameless post-mortem using the 5 Whys technique and create an action plan?Our error budget for the month is nearly exhausted (only 0.02% remaining). Should we do a code freeze? What processes should we put in place?name: Site Reliability Engineering description: Define Service Level Objectives (SLOs), manage Error Budgets, and conduct Incident Reviews to balance reliability with velocity.
Site Reliability Engineering (SRE)
Goal
Treat operations as a software problem. Quantify reliability so we know exactly when to freeze deployments (reliability at risk) and when to push fast (error budget available).
When to Use
- When defining "Is it stable enough?" criteria.
- After a production outage (Post-Mortem).
- When planning on-call rotations.
Instructions
1. Define SLIs (Service Level Indicators)
What is "good"?
- Availability: Successful requests / Total requests.
- Latency: Requests faster than 200ms / Total requests.
2. Set SLOs (Service Level Objectives)
What is the target? (100% is impossible).
- Target: "99.9% of requests in 30 days are successful."
- Window: Rolling 28 or 30 days.
3. Manage Error Budgets
(100% - SLO) = Error Budget.
- If you have 0.1% budget, you can fail 43 minutes a month.
- Rule: If budget is exhausted -> Code Freeze. Only reliability fixes allowed.
4. Incident Management
When things break:
- Detect: Alert fires.
- Respond: Acknowledge, triage, stabilize (mitigate impact).
- Analyze: Root cause analysis (5 Whys).
- Learn: Create action items to prevent recurrence.
Constraints
✅ Do
- DO: Blameless Post-Mortems. Focus on process failure, not human error.
- DO: Automate runbooks. If you run a command twice, script it.
- DO: Measure what matters to the user (Client-side latency), not just the server.
❌ Don't
- DON'T: Alert on things you can't fix immediately.
- DON'T: Page the whole team. Page the on-call engineer.
- DON'T: Optimize reliability past the SLO (diminishing returns).
Output Format
SLOs.md: Definitions of SLIs and targets.post-mortems/YYYY-MM-DD-incident.md: Incident review records.
Dependencies
devops/implementing-observability/SKILL.md
Architecte Docker Compose
DevOps
Concoit des configurations Docker Compose optimisees.
Rapport de Post-Mortem
DevOps
Rédige des rapports post-mortem d'incidents structurés et blameless.
Créateur de Runbooks
DevOps
Crée des runbooks opérationnels clairs pour les procédures DevOps courantes.