Déboguer les évaluations Hawk/Inspect bloquées

VérifiéPrudence

Diagnostique pourquoi une évaluation Hawk/Inspect AI est bloquée ou ne progresse pas. Fournit une liste de vérification, une recherche de motifs d'erreur et des tests directs de l'API pour identifier les problèmes (nouvelles tentatives d'API, plantages de pod, limites de tokens, etc.). Aide à récupérer l'évaluation en supprimant et redémarrant, ou en inspectant les journaux.

Spar Skills Guide Bot
TestingIntermédiaire
8002/06/2026
Claude Code
#debugging#stuck-evals#hawk#inspect-ai#evaluation

Recommandé pour

Notre avis

Permet de diagnostiquer et résoudre les problèmes d'évaluations Hawk/Inspect AI bloquées, en vérifiant l'authentification, l'état des pods, les logs et les appels API.

Points forts

  • Fournit une checklist structurée pour identifier rapidement la cause du blocage.
  • Inclut des techniques avancées comme l'appel direct à l'API et l'analyse des patterns de retry.
  • Propose des solutions concrètes pour chaque pattern d'erreur répertorié.

Limites

  • Nécessite un accès aux outils internes Hawk et à l'infrastructure S3.
  • Les solutions dépendent de la configuration spécifique de l'environnement (middleman, providers).
  • Ne couvre pas les problèmes d'infrastructure sous-jacente (réseau, quota API).
Quand l'utiliser

Lorsqu'une évaluation Hawk/Inspect AI ne progresse pas, que des échantillons restent en suspens ou que des erreurs 500 apparaissent.

Quand l'éviter

Pour des problèmes d'évaluation qui ne sont pas liés à des blocages, par exemple des erreurs de logique de tâche ou de scoring.

Analyse de sécurité

Prudence
Score qualité92/100

The skill instructs authenticated access to internal services and deletion of resources, but within a legitimate debugging context. It does not exfiltrate data, run destructive system commands, or disable safety. Risk is limited to users with existing authorization.

Points d'attention
  • Uses `curl` to test internal API endpoints with access tokens.
  • Instructs deletion of evaluation sets via `hawk delete`.
  • References internal infrastructure (middleman.internal.metr.org) and proprietary tools.

Exemples

Debug stuck eval
My Inspect eval is stuck. The eval-set ID is evalset-abc123. Can you help me debug it?
Investigate high retry count
I see 'HTTP retries: 45' in my eval logs. What does that mean and how do I fix the stuck eval?
Test API directly
I'm getting 500 errors from the middleman during an eval. How do I test the API directly to see the real error?

name: debug-stuck-eval description: Debug stuck Hawk/Inspect AI evaluations. Use when user mentions "stuck eval", "eval not progressing", "eval hanging", "samples not completing", "eval set frozen", "runner stuck", "500 errors in eval", "retry loop", "eval timeout", or asks why an evaluation isn't finishing.

Quick Checklist

  1. Verify auth: hawk auth access-token > /dev/null || echo "Run 'hawk login' first"
  2. Get eval-set-id from user
  3. Check status: hawk status <eval-set-id> - JSON report with pod state, logs, metrics
  4. View logs: hawk logs <eval-set-id> or hawk logs -f for follow mode
  5. List samples: hawk list samples <eval-set-id> - see completion status
  6. Look for error patterns (see below)
  7. Test API directly if logs show retries without clear errors

Error Patterns

| Log Pattern | Meaning | Resolution | |-------------|---------|------------| | [uuid task/id/epoch model] Retrying request to /responses | OpenAI SDK retry with sample context | Test API directly with curl to see real error | | [uuid task/id/epoch model] -> model retry N ... [ErrorType code] | Inspect retry with error summary | Check error type; use curl for full details | | 500 - Internal server error | API issue | Download buffer, find failing request, test through middleman AND directly to provider | | 400 - invalid_request_error | Token/context limit exceeded | Check message count and model context window | | Pod UID mismatch | Sandbox pod was killed and restarted | No fix needed—sample errored out, Inspect will retry | | Empty output, pending: true | API returned malformed response | Restart eval (buffer resumes) | | OOMKilled in pod status | Memory exhaustion | Increase pod memory limits |

Key Techniques

  1. Retry messages have sample context - All retry messages include a [sample_uuid task/sample_id/epoch model] prefix. Inspect's own retries also include a compact error summary suffix like [RateLimitError 429 rate_limit_exceeded]. The OpenAI SDK's internal retry messages still don't show the actual error — use curl for full details.
  2. FAIL-OK patterns are fine - Alternating failures and successes mean the eval IS progressing. Only worry about consistent FAIL-FAIL-FAIL patterns.
  3. Use S3 for buffer access - Download .buffer/ from S3 rather than accessing the runner pod directly.
  4. Read .eval files with inspect_ai - Use from inspect_ai.log import read_eval_log instead of manually extracting zips.

Test API Directly

Middleman is the auth proxy. If middleman fails but direct provider calls work, it's a middleman issue.

TOKEN=$(hawk auth access-token)

# Test through middleman
curl --max-time 300 -X POST https://middleman.internal.metr.org/anthropic/v1/messages \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{"model": "claude-sonnet-4-20250514", "max_tokens": 100, "messages": [{"role": "user", "content": "Say hello"}]}'

# Test OpenAI-compatible
curl --max-time 300 -X POST https://middleman.internal.metr.org/openai/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Say hello"}], "max_tokens": 100}'

Recovery

# Delete stuck eval and restart
hawk delete <eval-set-id>
hawk eval-set <config.yaml>

The sample buffer in S3 allows Inspect to resume from where it left off (unless you use --no-resume).

HTTP Retry Count

Task progress logs include "HTTP retries: X". High retry counts indicate API instability even while tasks complete.

Severity: Retry count × wait time = stuck duration. E.g., 45 retries × 1800s = 22+ hours stuck.

More Details

See docs/debugging-stuck-evals.md for:

  • Sample buffer SQL queries
  • Detailed API testing examples
  • Escalation checklist

References

Filing Issues

  • Middleman: https://github.com/metr-middleman/middleman-server/issues
  • Hawk: Linear issue on Evals Execution team
  • Inspect AI: https://github.com/UKGovernmentBEIS/inspect_ai/issues
Skills similaires