name: maintainx-incident-runbook description: | Incident response procedures for MaintainX integration failures. Use when experiencing outages, investigating issues, or responding to MaintainX integration incidents. Trigger with phrases like "maintainx incident", "maintainx outage", "maintainx down", "maintainx emergency", "maintainx runbook". allowed-tools: Read, Write, Edit, Bash(curl:), Bash(npm:) version: 1.0.0 license: MIT author: Jeremy Longshore jeremy@intentsolutions.io

MaintainX Incident Runbook

Overview

Step-by-step procedures for responding to MaintainX integration incidents, from detection through resolution.

Prerequisites

Access to monitoring dashboards
MaintainX admin credentials
On-call contact information

Incident Classification

| Severity | Definition | Response Time | Example | |----------|------------|---------------|---------| | SEV1 | Complete outage | 15 min | API unreachable | | SEV2 | Major degradation | 30 min | >50% errors | | SEV3 | Minor degradation | 2 hours | Elevated latency | | SEV4 | Informational | Next business day | Low cache hit rate |

Incident Response Flow

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   DETECT    │────▶│   TRIAGE    │────▶│  DIAGNOSE   │
│             │     │             │     │             │
│ Alert fires │     │ Determine   │     │ Identify    │
│ User report │     │ severity    │     │ root cause  │
└─────────────┘     └──────┬──────┘     └──────┬──────┘
                          │                    │
                          ▼                    ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   CLOSE     │◀────│   RECOVER   │◀────│  MITIGATE   │
│             │     │             │     │             │
│ Post-mortem │     │ Verify fix  │     │ Stop impact │
│ Document    │     │ Monitor     │     │ Temporary   │
└─────────────┘     └─────────────┘     └─────────────┘

Runbook Procedures

RB-001: MaintainX API Unreachable

Symptoms:

All API calls failing with network errors
Health check endpoint returning unhealthy
Timeout errors in logs

Diagnostic Steps:

#!/bin/bash
# scripts/diagnose-api-unreachable.sh

echo "=== MaintainX API Connectivity Diagnostic ==="
echo ""

# Step 1: Basic connectivity
echo "1. Testing DNS resolution..."
nslookup api.getmaintainx.com

echo ""
echo "2. Testing TCP connectivity..."
nc -zv api.getmaintainx.com 443

echo ""
echo "3. Testing HTTPS endpoint..."
curl -v --max-time 10 https://api.getmaintainx.com/v1/users?limit=1 \
  -H "Authorization: Bearer $MAINTAINX_API_KEY" 2>&1

echo ""
echo "4. Checking MaintainX status page..."
curl -s https://status.getmaintainx.com/api/v2/status.json | jq '.status.indicator'

echo ""
echo "5. Checking from different regions (if available)..."
# Add region-specific tests if you have multi-region deployment

Mitigation:

Check MaintainX Status Page
If MaintainX is down: Enable fallback mode / cached responses
If local network: Check firewall, DNS, proxy settings
Escalate to network team if needed

Resolution:

Wait for MaintainX restoration
Verify connectivity restored
Monitor for data sync catch-up

RB-002: High Error Rate (>5%)

Symptoms:

MaintainXHighErrorRate alert firing
4xx or 5xx errors in logs
Failed work order operations

Diagnostic Steps:

// scripts/diagnose-errors.ts

async function diagnoseHighErrorRate() {
  console.log('=== Error Rate Diagnostic ===\n');

  // Get recent error logs
  const errors = await getRecentErrors(100);

  // Group by error type
  const errorGroups = new Map<string, number>();
  errors.forEach(e => {
    const key = `${e.status}:${e.endpoint}`;
    errorGroups.set(key, (errorGroups.get(key) || 0) + 1);
  });

  console.log('Error distribution:');
  errorGroups.forEach((count, key) => {
    console.log(`  ${key}: ${count}`);
  });

  // Check for patterns
  console.log('\nCommon patterns:');

  const status401 = errors.filter(e => e.status === 401);
  if (status401.length > 0) {
    console.log('  - 401 Unauthorized: Check API key validity');
  }

  const status429 = errors.filter(e => e.status === 429);
  if (status429.length > 0) {
    console.log('  - 429 Rate Limited: Reduce request rate');
  }

  const status5xx = errors.filter(e => e.status >= 500);
  if (status5xx.length > 0) {
    console.log('  - 5xx Server Errors: MaintainX side issue');
  }
}

Mitigation by Error Type:

| Error | Immediate Action | |-------|-----------------| | 401 | Rotate API key | | 403 | Check permissions | | 429 | Enable throttling | | 5xx | Enable circuit breaker |

Resolution:

Fix underlying issue
Clear error state
Monitor for recurrence

RB-003: Webhook Processing Failures

Symptoms:

Webhooks not being processed
Events missing in local system
webhook_failures metric elevated

Diagnostic Steps:

#!/bin/bash
# Check webhook endpoint health
curl -v https://your-app.com/webhooks/maintainx/health

# Check webhook queue depth
redis-cli LLEN maintainx:webhook:queue

# Check recent failed webhooks
redis-cli LRANGE maintainx:webhook:failed 0 10

# Test webhook signature verification
./scripts/test-webhook.sh

Mitigation:

Check webhook secret is correct
Verify endpoint is accessible from internet
Check for queue backlog
Temporarily pause webhook processing if overwhelmed

Resolution:

Fix webhook handler
Replay failed webhooks
Verify event sync

RB-004: Data Sync Out of Sync

Symptoms:

Local data doesn't match MaintainX
Missing work orders or updates
Stale data being served

Diagnostic Steps:

// scripts/diagnose-sync.ts

async function diagnoseSyncIssues() {
  const client = new MaintainXClient();

  console.log('=== Sync Diagnostic ===\n');

  // Get sample from MaintainX
  const maintainxWOs = await client.getWorkOrders({ limit: 100 });

  // Get sample from local
  const localWOs = await localStore.getWorkOrders({ limit: 100 });

  // Compare
  const maintainxIds = new Set(maintainxWOs.workOrders.map(wo => wo.id));
  const localIds = new Set(localWOs.map(wo => wo.id));

  const missingLocally = [...maintainxIds].filter(id => !localIds.has(id));
  const extraLocally = [...localIds].filter(id => !maintainxIds.has(id));

  console.log(`MaintainX work orders: ${maintainxIds.size}`);
  console.log(`Local work orders: ${localIds.size}`);
  console.log(`Missing locally: ${missingLocally.length}`);
  console.log(`Extra locally: ${extraLocally.length}`);

  if (missingLocally.length > 0) {
    console.log('\nMissing IDs (first 10):');
    missingLocally.slice(0, 10).forEach(id => console.log(`  - ${id}`));
  }
}

Mitigation:

Trigger full sync if data is stale
Check webhook delivery
Verify polling job is running

Resolution:

Run reconciliation script
Fix sync process
Monitor for drift

RB-005: Performance Degradation

Symptoms:

MaintainXHighLatency alert firing
Slow user experience
Timeout errors

Diagnostic Steps:

// scripts/diagnose-performance.ts

async function diagnosePerformance() {
  console.log('=== Performance Diagnostic ===\n');

  // Test API latency
  const endpoints = ['/workorders', '/assets', '/locations', '/users'];

  for (const endpoint of endpoints) {
    const start = Date.now();
    await client.request(`GET ${endpoint}?limit=10`);
    const latency = Date.now() - start;

    const status = latency > 1000 ? '[SLOW]' : '[OK]';
    console.log(`${status} ${endpoint}: ${latency}ms`);
  }

  // Check cache performance
  const cacheStats = await cache.getStats();
  console.log('\nCache stats:');
  console.log(`  Hit rate: ${cacheStats.hitRate}%`);
  console.log(`  Size: ${cacheStats.size}`);

  // Check connection pool
  const poolStats = await getConnectionPoolStats();
  console.log('\nConnection pool:');
  console.log(`  Active: ${poolStats.active}`);
  console.log(`  Idle: ${poolStats.idle}`);
  console.log(`  Waiting: ${poolStats.waiting}`);
}

Mitigation:

Enable aggressive caching
Reduce concurrent requests
Add request queuing

Resolution:

Identify slow endpoints
Optimize queries
Scale resources if needed

Emergency Contacts

| Role | Contact | Escalation | |------|---------|------------| | On-call Engineer | PagerDuty | Automatic | | MaintainX Support | support@getmaintainx.com | Manual | | Infrastructure | #infra-oncall Slack | Manual |

Post-Incident Template

## Incident Report: [TITLE]

**Date:** YYYY-MM-DD
**Duration:** HH:MM - HH:MM (X hours)
**Severity:** SEVX
**Status:** Resolved

### Summary
Brief description of what happened.

### Timeline
- HH:MM - Alert fired
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Resolution confirmed

### Root Cause
What caused the incident.

### Impact
- X work orders affected
- Y minutes of downtime
- Z users impacted

### Resolution
What was done to fix it.

### Action Items
- [ ] Prevent recurrence
- [ ] Improve detection
- [ ] Update runbooks

Output

Incident diagnosed
Mitigation applied
Root cause identified
Resolution documented

Resources

Next Steps

For data handling patterns, see maintainx-data-handling.

MaintainX Incident Runbook

Recommended for

MaintainX Incident Runbook

Overview

Prerequisites

Incident Classification

Incident Response Flow

Runbook Procedures

RB-001: MaintainX API Unreachable

RB-002: High Error Rate (>5%)

RB-003: Webhook Processing Failures

RB-004: Data Sync Out of Sync

RB-005: Performance Degradation

Emergency Contacts

Post-Incident Template

Output

Resources

Next Steps

Docker Compose Architect

Incident Postmortem Writer

Runbook Creator