Our review

This skill helps set up monitoring, structured logging, distributed tracing, and alerting for applications and infrastructure.

Strengths

Covers the three pillars of observability (logs, metrics, traces)
Provides concrete Python code examples
Includes Golden Signals configuration for reliability

Limitations

Examples are limited to Python, no other languages
Does not cover integration with specific cloud providers
Assumes basic knowledge of observability concepts

When to use it

Use when starting observability in a new application or improving an existing stack.

When not to use it

Avoid if you already have a full observability solution and need advanced fine-tuning.

name: observability-specialist description: Use this for setting up monitoring, logging, distributed tracing, and alerting for applications and infrastructure.

Observability Specialist

You ensure systems are observable, debuggable, and reliable through metrics, logs, and traces.

When to use

"Set up monitoring for this app."
"Create an alert for high latency."
"Debug this production issue using logs."
"Implement distributed tracing."

Instructions

Structured Logging:
- Use JSON format for logs.
- Include essential fields: timestamp, level, service, trace_id, message.
- Log at appropriate levels (ERROR for faults, INFO for state changes, DEBUG for details).
Metrics:
- Track the "Golden Signals": Latency, Traffic, Errors, and Saturation.
- Use Prometheus-style metrics (Counters, Gauges, Histograms).
Tracing:
- Implement OpenTelemetry or similar for distributed tracing.
- Ensure trace context propagates across service boundaries.
Dashboards & Alerts:
- Create dashboards to visualize system health.
- Define alerts on meaningful symptoms (user error rate) rather than just internal causes (CPU high).

Examples

1. Structured Logging with JSON

import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "service": "my-service",
            "message": record.getMessage(),
            "trace_id": getattr(record, 'trace_id', None)
        }
        return json.dumps(log_data)

logger = logging.getLogger(__name__)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Usage
logger.info("User logged in", extra={"trace_id": "abc123"})

2. Prometheus Metrics

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
request_latency = Histogram('http_request_duration_seconds', 'HTTP request latency')
active_users = Gauge('active_users', 'Number of active users')

# Track metrics
@request_latency.time()
def handle_request(method, endpoint):
    # Your logic here
    time.sleep(0.1)
    request_count.labels(method=method, endpoint=endpoint, status='200').inc()

# Start metrics server
start_http_server(8000)

3. OpenTelemetry Distributed Tracing

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter

# Setup
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(ConsoleSpanExporter())
)

# Usage
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", "12345")
    # Your business logic
    with tracer.start_as_current_span("validate_payment"):
        # Payment validation logic
        pass

Observability Specialist

Recommended for

Our review

Strengths

Limitations

Security analysis

Examples

name: observability-specialist description: Use this for setting up monitoring, logging, distributed tracing, and alerting for applications and infrastructure.

Observability Specialist

When to use

Instructions

Examples

1. Structured Logging with JSON

2. Prometheus Metrics

3. OpenTelemetry Distributed Tracing

Docker Compose Architect

Incident Postmortem Writer

Runbook Creator