Promptfoo - Framework d'évaluation LLM
Outil CLI pour tester et comparer les résultats des modèles LLM. Permet de créer des configs d'évaluation, des cas de test, et des assertions personnalisées.
name: promptfoo description: | Promptfoo evaluation framework for testing and comparing LLM outputs. Use when writing eval configs, creating test cases, debugging eval runs, or working with assertions. allowed-tools:
- Bash(npx promptfoo:*)
- Bash(npm run evals:*)
- WebFetch(domain:www.promptfoo.dev)
Promptfoo
Promptfoo is a CLI tool for testing and comparing LLM outputs.
Config File
The CLI auto-discovers promptfooconfig.yaml in the current directory. Use -c path for other locations.
Supported extensions: .yaml, .json, .js
Configuration
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "What this eval tests"
prompts:
- file://prompt.txt
- |
Inline prompt with {{variable}} substitution
providers:
- anthropic:messages:claude-sonnet-4-5-20250929
defaultTest:
options:
provider:
config:
temperature: 0.0
max_tokens: 4096
tests:
- description: "What this case tests"
vars:
variable: "value"
from_file: file://data/input.txt
assert:
- type: contains
value: "expected substring"
# Or load tests from files
tests: file://cases/all.yaml
outputPath: ./results.json
evaluateOptions:
maxConcurrency: 4
Provider IDs
| Model | ID |
|-------|----|
| Opus 4.5 | anthropic:messages:claude-opus-4-5-20251101 |
| Sonnet 4.5 | anthropic:messages:claude-sonnet-4-5-20250929 |
| Haiku 4.5 | anthropic:messages:claude-haiku-4-5-20251001 |
Provider config: temperature, max_tokens, top_p, top_k, tools, tool_choice
Prompts
file://path.txt— load from file (path relative to config)- Inline string with
{{variable}}Nunjucks substitution - Chat format via JSON:
[{"role": "system", "content": "..."}, {"role": "user", "content": "{{input}}"}]
Assertion Types
| Type | Use | Value |
|------|-----|-------|
| contains | Substring match | "expected text" |
| icontains | Case-insensitive substring | "expected text" |
| equals | Exact match | "exact value" |
| regex | Pattern match | "\\d{4}-\\d{2}-\\d{2}" |
| is-json | Valid JSON output | — |
| contains-json | Output contains JSON | — |
| starts-with | Prefix match | "prefix" |
| cost | Max cost | threshold: 0.01 |
| latency | Max response time (ms) | threshold: 5000 |
| javascript | Custom JS expression | output.includes('x') |
| python | Custom Python | file://check.py:fn_name |
| llm-rubric | LLM-as-judge | rubric text |
| similar | Semantic similarity | value: "text", threshold: 0.8 |
| model-graded-factuality | Fact checking | — |
Prefix any assertion with not- to negate (e.g., not-contains).
llm-rubric
Uses an LLM to grade output against a rubric:
assert:
- type: llm-rubric
value: |
The response should:
- Mention at least 3 factors
- Include specific examples
threshold: 0.7
provider: anthropic:messages:claude-sonnet-4-5-20250929
javascript
Inline expressions or functions. Access output (string) and context (with vars, prompt):
assert:
- type: javascript
value: output.length > 100 && output.includes('route')
- type: javascript
value: |
const data = JSON.parse(output);
return data.calories >= 200 && data.calories <= 300;
Test Organization
Split cases into separate files and reference them:
tests:
- file://cases/basic.yaml
- file://cases/edge-cases.yaml
Each case file contains a YAML array of test objects.
CLI
npx promptfoo eval # Run with auto-discovered config
npx promptfoo eval -c path/to/config.yaml # Specific config
npx promptfoo eval --filter-metadata key=v # Filter tests
npx promptfoo view # Web UI for results
npx promptfoo cache clear # Clear result cache
References
Consult the configuration reference and Anthropic provider docs for full details.
Skills similaires
TDD Red-Green-Refactor
Skill qui guide Claude a travers le cycle TDD complet.
Audit d'Accessibilité Web
Réalise un audit d'accessibilité web complet selon les normes WCAG.
Générateur de Tests UAT
Génère des cas de test d'acceptation utilisateur structurés et complets.