Promptfoo - LLM Evaluation Framework
CLI tool for testing and comparing LLM outputs. Create evaluation configs, test cases, and custom assertions for validating model behavior.
name: promptfoo description: | Promptfoo evaluation framework for testing and comparing LLM outputs. Use when writing eval configs, creating test cases, debugging eval runs, or working with assertions. allowed-tools:
- Bash(npx promptfoo:*)
- Bash(npm run evals:*)
- WebFetch(domain:www.promptfoo.dev)
Promptfoo
Promptfoo is a CLI tool for testing and comparing LLM outputs.
Config File
The CLI auto-discovers promptfooconfig.yaml in the current directory. Use -c path for other locations.
Supported extensions: .yaml, .json, .js
Configuration
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "What this eval tests"
prompts:
- file://prompt.txt
- |
Inline prompt with {{variable}} substitution
providers:
- anthropic:messages:claude-sonnet-4-5-20250929
defaultTest:
options:
provider:
config:
temperature: 0.0
max_tokens: 4096
tests:
- description: "What this case tests"
vars:
variable: "value"
from_file: file://data/input.txt
assert:
- type: contains
value: "expected substring"
# Or load tests from files
tests: file://cases/all.yaml
outputPath: ./results.json
evaluateOptions:
maxConcurrency: 4
Provider IDs
| Model | ID |
|-------|----|
| Opus 4.5 | anthropic:messages:claude-opus-4-5-20251101 |
| Sonnet 4.5 | anthropic:messages:claude-sonnet-4-5-20250929 |
| Haiku 4.5 | anthropic:messages:claude-haiku-4-5-20251001 |
Provider config: temperature, max_tokens, top_p, top_k, tools, tool_choice
Prompts
file://path.txt— load from file (path relative to config)- Inline string with
{{variable}}Nunjucks substitution - Chat format via JSON:
[{"role": "system", "content": "..."}, {"role": "user", "content": "{{input}}"}]
Assertion Types
| Type | Use | Value |
|------|-----|-------|
| contains | Substring match | "expected text" |
| icontains | Case-insensitive substring | "expected text" |
| equals | Exact match | "exact value" |
| regex | Pattern match | "\\d{4}-\\d{2}-\\d{2}" |
| is-json | Valid JSON output | — |
| contains-json | Output contains JSON | — |
| starts-with | Prefix match | "prefix" |
| cost | Max cost | threshold: 0.01 |
| latency | Max response time (ms) | threshold: 5000 |
| javascript | Custom JS expression | output.includes('x') |
| python | Custom Python | file://check.py:fn_name |
| llm-rubric | LLM-as-judge | rubric text |
| similar | Semantic similarity | value: "text", threshold: 0.8 |
| model-graded-factuality | Fact checking | — |
Prefix any assertion with not- to negate (e.g., not-contains).
llm-rubric
Uses an LLM to grade output against a rubric:
assert:
- type: llm-rubric
value: |
The response should:
- Mention at least 3 factors
- Include specific examples
threshold: 0.7
provider: anthropic:messages:claude-sonnet-4-5-20250929
javascript
Inline expressions or functions. Access output (string) and context (with vars, prompt):
assert:
- type: javascript
value: output.length > 100 && output.includes('route')
- type: javascript
value: |
const data = JSON.parse(output);
return data.calories >= 200 && data.calories <= 300;
Test Organization
Split cases into separate files and reference them:
tests:
- file://cases/basic.yaml
- file://cases/edge-cases.yaml
Each case file contains a YAML array of test objects.
CLI
npx promptfoo eval # Run with auto-discovered config
npx promptfoo eval -c path/to/config.yaml # Specific config
npx promptfoo eval --filter-metadata key=v # Filter tests
npx promptfoo view # Web UI for results
npx promptfoo cache clear # Clear result cache
References
Consult the configuration reference and Anthropic provider docs for full details.
Related skills
TDD Red-Green-Refactor
Skill that guides Claude through the complete TDD cycle.
Web Accessibility Audit
Performs a comprehensive web accessibility audit following WCAG standards.
UAT Test Case Generator
Generates structured and comprehensive user acceptance test cases.