Notre avis
Cette compétence aide à déboguer les évaluations Hawk/Inspect AI bloquées en vérifiant le statut, les logs et les motifs d'erreur.
Points forts
- Fournit une checklist systématique pour diagnostiquer les évaluations bloquées.
- Inclut des techniques de test API direct pour identifier les erreurs réelles.
- Propose des commandes de récupération comme la suppression et le redémarrage de l'évaluation.
- Référence la documentation détaillée pour les problèmes plus profonds.
Limites
- Nécessite un accès à la CLI Hawk et au buffer S3.
- Certains motifs d'erreur peuvent ne pas être couverts.
- Suppose une familiarité de base avec Inspect AI et Middleman.
Utilisez lorsqu'une évaluation IA ne progresse pas, que les échantillons ne se terminent pas, ou que vous voyez des boucles de nouvelle tentative ou des erreurs 500.
Ne pas utiliser pour le débogage général de modèles ou lorsque l'évaluation se déroule normalement sans problèmes.
Analyse de sécurité
SûrThe skill only provides debugging instructions and does not instruct any destructive actions, exfiltration, or execution of obfuscated payloads. It uses legitimate internal CLI tools and API calls with user's own auth token for diagnostic purposes. No risk identified.
Aucun point d'attention détecté
Exemples
My eval set is stuck. Can you help debug it?I'm getting 500 errors during the evaluation. What should I do?The evaluation is retrying many times and not finishing.name: debug-stuck-eval description: Debug stuck Hawk/Inspect AI evaluations. Use when user mentions "stuck eval", "eval not progressing", "eval hanging", "samples not completing", "eval set frozen", "runner stuck", "500 errors in eval", "retry loop", "eval timeout", or asks why an evaluation isn't finishing.
Quick Checklist
- Verify auth:
hawk auth access-token > /dev/null || echo "Run 'hawk login' first" - Get eval-set-id from user
- Check status:
hawk status <eval-set-id>- JSON report with pod state, logs, metrics - View logs:
hawk logs <eval-set-id>orhawk logs -ffor follow mode - List samples:
hawk list samples <eval-set-id>- see completion status - Look for error patterns (see below)
- Test API directly if logs show retries without clear errors
Error Patterns
| Log Pattern | Meaning | Resolution |
|-------------|---------|------------|
| [uuid task/id/epoch model] Retrying request to /responses | OpenAI SDK retry with sample context | Test API directly with curl to see real error |
| [uuid task/id/epoch model] -> model retry N ... [ErrorType code] | Inspect retry with error summary | Check error type; use curl for full details |
| 500 - Internal server error | API issue | Download buffer, find failing request, test through middleman AND directly to provider |
| 400 - invalid_request_error | Token/context limit exceeded | Check message count and model context window |
| Pod UID mismatch | Sandbox pod was killed and restarted | No fix needed—sample errored out, Inspect will retry |
| Empty output, pending: true | API returned malformed response | Restart eval (buffer resumes) |
| OOMKilled in pod status | Memory exhaustion | Increase pod memory limits |
Key Techniques
- Retry messages have sample context - All retry messages include a
[sample_uuid task/sample_id/epoch model]prefix. Inspect's own retries also include a compact error summary suffix like[RateLimitError 429 rate_limit_exceeded]. The OpenAI SDK's internal retry messages still don't show the actual error — use curl for full details. - FAIL-OK patterns are fine - Alternating failures and successes mean the eval IS progressing. Only worry about consistent FAIL-FAIL-FAIL patterns.
- Use S3 for buffer access - Download
.buffer/from S3 rather than accessing the runner pod directly. - Read .eval files with inspect_ai - Use
from inspect_ai.log import read_eval_loginstead of manually extracting zips.
Test API Directly
Middleman is the auth proxy. If middleman fails but direct provider calls work, it's a middleman issue.
TOKEN=$(hawk auth access-token)
# Test through middleman
curl --max-time 300 -X POST https://middleman.internal.metr.org/anthropic/v1/messages \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-d '{"model": "claude-sonnet-4-20250514", "max_tokens": 100, "messages": [{"role": "user", "content": "Say hello"}]}'
# Test OpenAI-compatible
curl --max-time 300 -X POST https://middleman.internal.metr.org/openai/v1/chat/completions \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Say hello"}], "max_tokens": 100}'
Recovery
# Delete stuck eval and restart
hawk delete <eval-set-id>
hawk eval-set <config.yaml>
The sample buffer in S3 allows Inspect to resume from where it left off (unless you use --no-resume).
HTTP Retry Count
Task progress logs include "HTTP retries: X". High retry counts indicate API instability even while tasks complete.
Severity: Retry count × wait time = stuck duration. E.g., 45 retries × 1800s = 22+ hours stuck.
More Details
See docs/debugging-stuck-evals.md for:
- Sample buffer SQL queries
- Detailed API testing examples
- Escalation checklist
References
- Inspect AI Model Providers - Model configuration
- Inspect AI Eval Logs - .eval file format
Filing Issues
- Middleman: https://github.com/metr-middleman/middleman-server/issues
- Hawk: Linear issue on Evals Execution team
- Inspect AI: https://github.com/UKGovernmentBEIS/inspect_ai/issues
Ingénierie de Prompts
Data & IA
Bonnes pratiques et templates de prompt engineering pour maximiser les résultats IA.
Visualisation de Données
Data & IA
Génère des visualisations de données et graphiques adaptés à vos données.
Architecture RAG
Data & IA
Guide de configuration d'architectures RAG (Retrieval-Augmented Generation).