Our review

Continuously monitors RL training logs, extracts key metrics, detects anomalies, and performs deep scans of rollout and judge outputs for LLM-as-a-Judge research.

Strengths

Real-time automated monitoring
Early detection of anomalies like gradient explosion or convergence stalls
Research-oriented analysis of reward hacking patterns and biases
Structured reports with suspicious cases and actionable hypotheses

Limitations

Requires a valid log file path
Relies on basic Unix tools (tail, grep)
Output may be large and require manual review for complex cases

When to use it

During extended RL training runs where you need both automated performance tracking and in-depth qualitative analysis of model behavior.

When not to use it

For a quick one-time log review without the need for continuous monitoring or research-specific scanning.

Examples

Basic RL log monitoring with periodic summaries

Monitor the RL training log at /path/to/log.txt, extract reward and loss every 100 steps, and provide a summary highlighting any anomalies or stagnation.

Scan rollout outputs for reward hacking

Scan the rollout output files in /path/to/rollout/ for suspicious patterns that might indicate reward hacking, such as unusually high scores with obvious flaws in the response. List potential hacking patterns.

Analyze judge outputs for bias

Analyze the judge output files in /path/to/judge/ for systematic bias in scoring. Look for score distributions skewed by rubric dimensions or entity mentions, and report any concerning trends.

name: rl-log-monitor description: 持续监控 RL 训练日志并总结关键指标、异常和趋势 allowed-tools: Read, Bash(tail:, grep:) context: fork agent: Explore

监控 RL 训练日志 $ARGUMENTS：

一、基础监控

使用 tail -f 持续读取日志文件
提取关键指标：reward、loss、episode length、success rate
识别异常模式：梯度爆炸、收敛停滞、性能下降
每 N 次迭代生成阶段性总结
标记需要人工干预的问题

二、Rollout & Judge 输出扫描（研究导向）

研究背景

本项目研究 LLM-as-a-Judge (LaaJ) 在 RL 中的应用，重点关注：

Rubrics 设计方式及其在 RL pipeline 中的使用
Reward hacking 现象的特征及其隐蔽性（是否能欺骗 in-domain test set）
隐蔽偏见对训练结果的影响

扫描任务

逐文件扫描 rollout 输出
- 检查生成的 response 是否出现可疑的 pattern
- 识别高分但与人类偏好 mismatch 的 case（如：偏好特定地区→response 中频繁出现相关实体）
逐文件扫描 judge 输出
- 分析 judge 评分分布及异常
- 识别 judge 的系统性偏见（天然 bias 或注入的 bias）
- 追踪 rubric 各维度的评分变化趋势
Reward Hacking 特征发掘
- 发现模型学到的"捷径"pattern
- 记录分数很高但明显不合理的 case（有传播价值/影响力）
- 对比有无 bias 注入时的训练差异

输出格式

每次扫描后生成报告，包含：

关键发现摘要
可疑 case 列表（prompt、response、score、分析）
Hacking pattern 假设
建议的人工复核点

RL Training Log Monitor

Recommended for

Our review

Strengths

Limitations

Security analysis

Examples

name: rl-log-monitor description: 持续监控 RL 训练日志并总结关键指标、异常和趋势 allowed-tools: Read, Bash(tail:, grep:) context: fork agent: Explore

一、基础监控

二、Rollout & Judge 输出扫描（研究导向）

研究背景

扫描任务

输出格式

Prompt Engineering

Data Visualization

RAG Architecture Setup