Our review
This skill lets you record, track, and evaluate the accuracy of AI predictions over time, assigning a status (verified, falsified, etc.) and an accuracy score.
Strengths
- Clear structure with required and optional fields for each prediction
- Nuanced evaluation categories (verified, falsified, partially verified, etc.)
- Step-by-step evaluation process with 0-1 scoring
- Predictor calibration assessment (underconfident, overconfident, etc.)
Limitations
- Requires human interpretation for ambiguous predictions
- Does not automate evidence gathering
- Accuracy scoring remains subjective without predefined quantitative criteria
When you want to analyze the prediction history of an AI researcher or commentator to judge their reliability.
If you look for an automated fact-checking tool or if predictions are purely qualitative with no clear deadline.
Security analysis
SafeThis skill is purely descriptive and provides a framework for tracking and evaluating AI predictions. It does not instruct any code execution, tool usage, or actions that could compromise security. It is designed for manual use in recording and analyzing prediction accuracy.
No concerns found
Examples
Evaluate the prediction by Sam Altman from March 2023 that GPT-5 would achieve AGI by 2027. Use the prediction tracking skill: restate the prediction, identify timeframe, gather evidence, assign status, score accuracy, and note lessons.Using the prediction tracking skill, compile a list of predictions made by Gary Marcus over the past 5 years, evaluate each with status and accuracy score, then compute overall accuracy and calibration (overconfident, underconfident, etc.) using the stats JSON output format.name: prediction-tracking description: Track and evaluate AI predictions over time to assess accuracy. Use when reviewing past predictions to determine if they came true, failed, or remain uncertain.
Prediction Tracking Skill
Track predictions made by AI researchers and critics, evaluate their accuracy over time.
Prediction Recording
When recording a new prediction, capture:
Required Fields
- text: The prediction as stated
- author: Who made it
- madeAt: When it was made
- timeframe: When they expect it to happen
- topic: What area of AI
- confidence: How confident they seemed
Optional Fields
- sourceUrl: Where the prediction was made
- targetDate: Specific date if mentioned
- conditions: Any caveats or conditions
- metrics: How to measure success
Evaluation Status
When evaluating predictions, assign one of:
verified
Clearly came true as stated.
- The predicted capability/event occurred
- Within the stated timeframe
- Substantially as described
falsified
Clearly did not come true.
- Timeframe passed without occurrence
- Contradictory evidence emerged
- Author retracted or modified claim
partially-verified
Partially accurate.
- Some aspects came true, others didn't
- Capability exists but weaker than claimed
- Timeframe was off but direction correct
too-early
Not enough time has passed.
- Still within stated timeframe
- No definitive evidence either way
unfalsifiable
Cannot be objectively assessed.
- Too vague to measure
- No clear success criteria
- Moved goalposts
ambiguous
Prediction was too vague to evaluate.
- Multiple interpretations possible
- Success criteria unclear
Evaluation Process
For each prediction being evaluated:
1. Restate the prediction
What exactly was claimed?
2. Identify timeframe
Has enough time passed to evaluate?
3. Gather evidence
What has happened since?
- Relevant releases or announcements
- Benchmark results
- Real-world deployments
- Counter-evidence
4. Assess status
Which evaluation status applies?
5. Score accuracy
If verifiable, rate 0.0-1.0:
- 1.0: Exactly as predicted
- 0.7-0.9: Substantially correct
- 0.4-0.6: Partially correct
- 0.1-0.3: Mostly wrong
- 0.0: Completely wrong
6. Note lessons
What does this tell us about:
- The author's forecasting ability
- The topic's predictability
- Common prediction pitfalls
Output Format
For evaluation:
{
"evaluations": [
{
"predictionId": "id",
"status": "verified",
"accuracyScore": 0.85,
"evidence": "Description of evidence",
"notes": "Additional context",
"evaluatedAt": "timestamp"
}
]
}
For accuracy statistics:
{
"author": "Author name",
"totalPredictions": 15,
"verified": 5,
"falsified": 3,
"partiallyVerified": 2,
"pending": 4,
"unfalsifiable": 1,
"averageAccuracy": 0.62,
"topicBreakdown": {
"reasoning": { "predictions": 5, "accuracy": 0.7 },
"agents": { "predictions": 3, "accuracy": 0.4 }
},
"calibration": "Assessment of how well-calibrated they are"
}
Calibration Assessment
Evaluate whether predictors are well-calibrated:
Well-Calibrated
- High-confidence predictions usually come true
- Low-confidence predictions have mixed results
- Acknowledges uncertainty appropriately
Overconfident
- High-confidence predictions often fail
- Rarely expresses uncertainty
- Doesn't update on evidence
Underconfident
- Low-confidence predictions often come true
- Hedges even on likely outcomes
- Too conservative
Inconsistent
- Confidence doesn't correlate with accuracy
- Random relationship between stated and actual accuracy
Tracking Notable Predictors
Keep running assessments of key voices:
| Predictor | Total | Accuracy | Calibration | Notes | |-----------|-------|----------|-------------|-------| | Sam Altman | 20 | 55% | Overconfident | Timeline optimism | | Gary Marcus | 15 | 70% | Well-calibrated | Conservative | | Dario Amodei | 12 | 65% | Slightly over | Safety-focused |
Red Flags
Watch for prediction patterns that suggest bias:
- Always bullish regardless of topic
- Never acknowledges failed predictions
- Moves goalposts when wrong
- Predictions align suspiciously with financial interests
- Vague enough to claim credit for anything
Prompt Engineering
Data & AI
Prompt engineering best practices and templates to maximize AI outputs.
Data Visualization
Data & AI
Generates data visualizations and charts tailored to your data.
RAG Architecture Setup
Data & AI
Setup guide for RAG (Retrieval-Augmented Generation) architectures.