Our review

Generates operational runbooks for services, procedures, or incident response with step-by-step procedures, troubleshooting guides, and escalation paths, by analyzing codebase and infrastructure.

Strengths

Produces actionable procedures with verified commands
Maps dependencies and their failure impact
Provides troubleshooting guides and escalation paths
Based on multi-track investigation (code, infrastructure, best practices)

Limitations

Requires read access to codebase and infrastructure
May not cover all edge cases or undocumented scenarios
Generated commands depend on current configuration which may change

When to use it

Use this skill when you need to create or standardize operational documentation for an existing service, maintenance procedure, or incident response plan.

When not to use it

Avoid using it for purely theoretical services or when you lack access to the actual codebase and infrastructure, as commands and dependencies cannot be verified.

Examples

Payment service runbook

Generate a runbook for the payment-service, covering deployment, scaling, and common failure scenarios.

Database failover runbook

Create a runbook for PostgreSQL failover procedure in production, including pre-checks, steps, and rollback.

Incident response runbook

Build an incident response runbook for high latency in the API gateway, including diagnosis, mitigation, and escalation.

name: runbook description: Generate operational runbooks for services, procedures, or incident response with step-by-step procedures, troubleshooting guides, and escalation paths license: MIT compatibility:

runtime:any allowed-tools:
Read
Glob
Grep
Write metadata: author: thoreinstein version: 1.0.0

Runbook

Generate operational runbooks for services, procedures, or incident response. Investigates the codebase and infrastructure to produce accurate, actionable procedures.

When to Use

Creating operational documentation for a service
Documenting deployment, scaling, or maintenance procedures
Building incident response playbooks
Standardizing operational procedures across teams

Input

Topic: Service name, operation type, or incident scenario
Scope: deployment, scaling, failover, maintenance, troubleshooting
Optional: Specific scenarios to cover

Investigation Strategy

Launch parallel investigation tracks to gather comprehensive information:

Track 1: Codebase Exploration

Identify service entry points and configuration
Find health check endpoints
Map dependencies (databases, caches, external services)
Locate logging and metrics instrumentation
Find existing scripts or automation

Track 2: Infrastructure Analysis

Review deployment manifests (Kubernetes, Terraform, etc.)
Identify scaling configuration
Map service dependencies
Find monitoring and alerting setup
Review backup and recovery procedures

Track 3: External Research

Find operational best practices for the service type
Research common failure modes
Identify industry-standard procedures

Output

Generate the runbook document using the template at references/templates/runbook.md.

The runbook should include:

Service overview and architecture
Dependencies with failure impact
Step-by-step procedures with actual commands
Troubleshooting guides for common issues
Escalation paths and contacts

Behavior

Parse topic to identify service and operation scope
Launch parallel investigation tracks
Extract configuration, endpoints, and dependencies from codebase
Identify common operations and failure modes
Generate step-by-step procedures with actual commands
Document troubleshooting steps and escalation paths

Constraints

Accuracy: All commands must be verified against actual codebase/infrastructure
Actionable: Every procedure must have concrete, executable steps
Complete: Include prerequisites, verification, and rollback for each procedure
Maintainable: Note dependencies that may change and require updates

Example

Input: "Generate runbook for the payment-service"

Investigation:
- Found deployment at k8s/payment-service/
- Found health endpoints: /health, /ready
- Dependencies: PostgreSQL (critical), Redis (cache), Stripe API
- Scaling: HPA configured, min 3, max 10 replicas
- Alerts: Prometheus rules in monitoring/

Generated Runbook: payment-service-runbook.md

## Overview
- Service: payment-service
- Owner: payments-team
- Criticality: P1

## Dependencies
| Dependency | Type | Criticality | Failure Impact |
|------------|------|-------------|----------------|
| PostgreSQL | Database | Critical | Full outage |
| Redis | Cache | High | Degraded latency |
| Stripe API | External | Critical | Payment failures |

## Procedures

### Deployment
1. Verify no active transactions
   ```bash
   kubectl exec -it payment-service-0 -- curl localhost:8080/metrics | grep active_transactions

Apply new deployment

kubectl apply -f k8s/payment-service/deployment.yaml

Monitor rollout

kubectl rollout status deployment/payment-service

Scaling

kubectl scale deployment payment-service --replicas=5

Troubleshooting

High Latency

Symptoms: p99 latency > 500ms Diagnosis:

kubectl top pods -l app=payment-service
kubectl logs -l app=payment-service --tail=100 | grep -i slow

Resolution: Check Redis connection, scale if CPU > 80%


Begin by identifying the service or operation to document and launching investigation tracks.

Operational Runbook Generation

Recommended for