Résumer les résultats d'expérience

VérifiéSûr

Génère un fichier summary.md qui résume les métriques clés d'une expérience de fine-tuning et d'évaluation. Analyse les statuts des runs, extrait la perte finale d'entraînement depuis les logs SLURM et la précision depuis les fichiers .eval, puis compile le tout dans un rapport textuel. Utile après avoir exécuté une expérience pour obtenir un aperçu rapide des résultats sans parcourir manuellement les fichiers de sortie.

Spar Skills Guide Bot
Data & IAIntermédiaire
9002/06/2026
Claude Code
#experiment-summary#machine-learning#evaluation#slurm#metrics

Recommandé pour

Notre avis

Crée un résumé léger des résultats d'expériences d'apprentissage automatique terminées, en capturant les métriques clés comme la perte d'entraînement et la précision d'évaluation.

Points forts

  • Extraction automatique de la perte d'entraînement à partir des sorties SLURM
  • Correspondance de la précision d'évaluation à des époques spécifiques via les noms de jobs SLURM
  • Prise en charge des résultats partiels et journalisation du processus

Limites

  • Nécessite une structure de fichiers spécifique (experiment_summary.yaml, sorties SLURM)
  • Dépend de l'environnement conda avec inspect-ai
  • Le mappage des époques repose sur une convention de nommage dans les noms de jobs SLURM
Quand l'utiliser

Après avoir exécuté une expérience de fine-tuning pour générer un résumé textuel concis des résultats.

Quand l'éviter

Pour un suivi en temps réel pendant l'entraînement ou lorsque les résultats de l'expérience ne sont pas encore disponibles.

Analyse de sécurité

Sûr
Score qualité92/100

The skill only performs data summarization tasks using standard command-line tools and project-local Python scripts. No destructive, exfiltrating, or obfuscated actions are instructed. The use of python and grep is legitimate for parsing logs, and there is no risk of exposing secrets or disabling safety.

Aucun point d'attention détecté

Exemples

Summarize current experiment
Summarize the experiment results for the current directory.
Summarize experiment from path
Generate a summary.md for the experiment located at /path/to/experiment.
Check experiment summary
Run summarize-experiment on the experiment in the current directory.

name: summarize-experiment description: Create a lightweight summary of experiment results from a completed (fine-tuned and evaluated) experiment. Use after run-experiment to capture key metrics from the experiment in textual form.

Summarize Experiment

Generate a summary.md file capturing key metrics from a completed experiment. Think R's summary() for experiment results.

Your Task

Create a lightweight summary of experiment results:

  1. Parse run status from experiment_summary.yaml
  2. Extract final training loss from SLURM stdout
  3. Extract accuracy from inspect-ai .eval files
  4. Generate summary.md in experiment directory
  5. Log the process in logs/summarize-experiment.log

Prerequisites

  • experiment_summary.yaml exists
  • At least some runs have completed (partial results acceptable)
  • run-experiment has been executed (or manual SLURM jobs run)
  • Conda environment activated - The parse_eval_log.py script requires inspect-ai. Activate the conda environment from claude.local.md before running extraction commands.

Workflow

1. Locate Experiment

Find the experiment directory:

  • If in an experiment directory (contains experiment_summary.yaml): use current directory
  • Otherwise: ask user for path

2. Parse Run Status

Read experiment_summary.yaml to identify runs:

From runs: section:

  • name: Run identifier
  • type: "fine-tuned" or "control"
  • model: Model name
  • parameters: Dict of hyperparameters (empty for control runs)

From evaluation.matrix: section:

  • run: Run name
  • tasks: List of evaluation task names
  • epochs: List of epochs to evaluate (null for control runs)

Determine status by checking filesystem:

  • Fine-tuning: Check for {output_base}/ck-out-{run_name}/ and SLURM outputs
  • Evaluation: Check for {run_dir}/eval/logs/*.eval files

3. Extract Training Loss

For each COMPLETED fine-tuning run:

  1. Find SLURM stdout in the output directory:
    • Parse experiment_summary.yaml "Output" section for output_dir_base
    • Look in: {output_dir_base}/ck-out-{run_name}/slurm-*.out
    • If multiple files, use most recent by modification time
  2. Extract final loss using regex: (\d+)\|(\d+)\|Loss: ([0-9.]+)
    • Pattern matches: {epoch}|{step}|Loss: {value}
    • Take the LAST match to get final loss
    • The step number (group 2) from the last match is the total training steps
  3. Record: run_name, final_loss, total_steps, epoch, step

Note: Training SLURM outputs are in the output directory, NOT the run directory.

If SLURM stdout missing:

  • Log warning
  • Record "N/A" for loss
  • Continue with other runs

4. Extract Evaluation Accuracy

For each COMPLETED evaluation:

  1. Find .eval files: {run_dir}/eval/logs/*.eval
  2. For each .eval file, run:
    python tools/inspect/parse_eval_log.py {path}
    
  3. Parse JSON output for accuracy
  4. Map to epoch using SLURM job names (see below)
  5. For binary tasks, also run summary_binary.py to get balanced accuracy and F1
  6. Record: run_name, task, epoch, accuracy, balanced_accuracy, f1, samples

Script output format:

{
  "status": "success",
  "task": "capitalization",
  "accuracy": 0.85,
  "samples": 100,
  "scorer": "exact_match",
  "model": "..."
}

Mapping Epochs via SLURM Job Names

The .eval files don't currently store epoch information directly. To reliably map each evaluation to its epoch:

  1. Find SLURM output files in the eval directory: {run_dir}/eval/slurm-*.out
  2. Extract job IDs from filenames (e.g., slurm-2773062.out → job ID 2773062)
  3. Query job names via sacct:
    sacct -j {job_ids} --format=JobID,JobName%50
    
  4. Parse epoch from job name - scaffold-inspect names jobs like eval-{task}-{run}-ep{N}:
    • eval-general_eval-lowlr-ep0 → epoch 0
    • eval-general_eval-lowlr-ep9 → epoch 9
  5. Extract accuracy from SLURM output:
    grep -oP 'match/accuracy: \K[0-9.]+' slurm-{jobid}.out
    

Example workflow:

# Get job names for all eval jobs
sacct -j 2773062,2773063,2773065 --format=JobID,JobName%50

# Output shows epoch in job name:
# 2773062  eval-general_eval-lowlr-ep0
# 2773063  eval-general_eval-lowlr-ep1
# 2773065  eval-general_eval-lowlr-ep2

This approach is reliable because:

  • Job names are set by scaffold-inspect and include epoch info
  • Works regardless of submission order or timing
  • Survives job failures and resubmissions

If extraction fails:

  • Script returns {"status": "error", "message": "..."}
  • Log the error
  • Record "ERROR" for accuracy
  • Continue with other evaluations

Computing Balanced Accuracy and F1 (Binary Classification)

For binary classification tasks (0/1 targets), use summary_binary.py to compute additional metrics:

python tools/inspect/summary_binary.py {path_to_eval_file} --json

JSON output format:

{
  "status": "success",
  "path": "/path/to/file.eval",
  "samples": 100,
  "accuracy": 0.85,
  "balanced_accuracy": 0.83,
  "f1": 0.82,
  "precision_1": 0.80,
  "recall_1": 0.84,
  "recall_0": 0.82,
  "confusion_matrix": {"tp": 42, "tn": 43, "fp": 7, "fn": 8, "other": 0}
}

Why these metrics matter for imbalanced data:

  • Balanced Accuracy = (Recall_0 + Recall_1) / 2 — not inflated by majority class
  • F1 Score = harmonic mean of precision and recall — penalizes class imbalance

Note: For non-binary tasks, only accuracy is reported (Bal. Acc and F1 shown as "-").

5. Generate summary.md

Create {experiment_dir}/summary.md with the following structure:

# Experiment Summary

**Experiment:** `{experiment_name}` | **Generated:** {timestamp} | **Status:** {X}/{Y} complete

## Run Status

| Run | Type | Fine-tuning | Evaluation |
|-----|------|-------------|------------|
| rank4_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED |
| rank8_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED |
| base_model | Control | N/A | COMPLETED |

## Training Results

| Run | Final Loss | Total Steps | Epochs | Duration |
|-----|------------|-------------|--------|----------|
| rank4_lr1e-5 | 0.234 | 250 | 2 | 8m 15s |
| rank8_lr1e-5 | 0.198 | 250 | 2 | 9m 02s |

**Notes:**
- Base model runs have no training loss (control)
- Duration from SLURM elapsed time (if available)

## Evaluation Results

| Run | Task | Epoch | Accuracy | Bal. Acc | F1 | Samples |
|-----|------|-------|----------|----------|------|---------|
| rank4_lr1e-5 | capitalization | 0 | 0.85 | 0.83 | 0.82 | 100 |
| rank4_lr1e-5 | capitalization | 1 | 0.88 | 0.86 | 0.85 | 100 |
| rank8_lr1e-5 | capitalization | 0 | 0.82 | 0.80 | 0.78 | 100 |
| rank8_lr1e-5 | capitalization | 1 | 0.91 | 0.89 | 0.88 | 100 |
| base_model | capitalization | - | 0.45 | 0.50 | 0.31 | 100 |

**Best performing:** rank8_lr1e-5 (epoch 1) with 89% balanced accuracy

## Incomplete Runs

| Run | Stage | Status | Notes |
|-----|-------|--------|-------|
| rank16_lr1e-5 | Fine-tuning | FAILED | Check slurm-12345.out |

## Next Steps

1. View detailed evaluation results: `inspect view --port=$(get_free_port)`
2. Export raw data: `inspect log export {run_dir}/eval/logs/*.eval --format csv`
3. Full analysis: `analyze-experiment` (when available)

---
*Generated by summarize-experiment skill*

6. Create Log

Document the process in {experiment_dir}/logs/summarize-experiment.log.

See logging.md for action types and format.

Error Handling

If SLURM stdout missing

  • Log warning with action type EXTRACT_LOSS
  • Record "N/A" for loss in summary
  • Continue with other runs

If .eval file cannot be parsed

  • Log error with file path
  • Record "ERROR" for accuracy in summary
  • Continue with other evaluations

If all runs failed

  • Generate summary noting all failures
  • Include failure states in "Incomplete Runs" section
  • Suggest troubleshooting steps

If partial results

  • Generate summary with available data
  • Clearly indicate which runs are missing in "Incomplete Runs" section
  • Still identify best performing run from available data

Idempotency

Running summarize-experiment multiple times overwrites summary.md. This is intentional:

  • Allows re-running after fixing failed runs
  • Summary always reflects current state

Output Files

{experiment_dir}/
├── summary.md                    # Human-readable summary (new)
└── logs/
    └── summarize-experiment.log  # Process log (new)

Relationship to Other Skills

  • After: run-experiment (or manual execution)
  • Before: analyze-experiment (when available)
  • Optional hook: run-experiment can invoke this at completion

Future Compatibility

When analyze-experiment is built, summarize-experiment can either:

  • Remain as a quick summary option (text only, no plots)
  • Be deprecated in favor of richer output
  • Become a first stage that analyze-experiment builds upon
Skills similaires