Promptfoo - Framework d'évaluation LLM

Outil CLI pour tester et comparer les résultats des modèles LLM. Permet de créer des configs d'évaluation, des cas de test, et des assertions personnalisées.

Spar Skills Guide Bot
TestingIntermédiaire0 vues0 installations08/03/2026
Claude CodeCursorWindsurfCopilot
llm-testingevaluation-frameworkprompt-engineeringquality-assurancecli-tool

name: promptfoo description: | Promptfoo evaluation framework for testing and comparing LLM outputs. Use when writing eval configs, creating test cases, debugging eval runs, or working with assertions. allowed-tools:

  • Bash(npx promptfoo:*)
  • Bash(npm run evals:*)
  • WebFetch(domain:www.promptfoo.dev)

Promptfoo

Promptfoo is a CLI tool for testing and comparing LLM outputs.

Config File

The CLI auto-discovers promptfooconfig.yaml in the current directory. Use -c path for other locations.

Supported extensions: .yaml, .json, .js

Configuration

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "What this eval tests"

prompts:
  - file://prompt.txt
  - |
    Inline prompt with {{variable}} substitution

providers:
  - anthropic:messages:claude-sonnet-4-5-20250929

defaultTest:
  options:
    provider:
      config:
        temperature: 0.0
        max_tokens: 4096

tests:
  - description: "What this case tests"
    vars:
      variable: "value"
      from_file: file://data/input.txt
    assert:
      - type: contains
        value: "expected substring"

# Or load tests from files
tests: file://cases/all.yaml

outputPath: ./results.json

evaluateOptions:
  maxConcurrency: 4

Provider IDs

| Model | ID | |-------|----| | Opus 4.5 | anthropic:messages:claude-opus-4-5-20251101 | | Sonnet 4.5 | anthropic:messages:claude-sonnet-4-5-20250929 | | Haiku 4.5 | anthropic:messages:claude-haiku-4-5-20251001 |

Provider config: temperature, max_tokens, top_p, top_k, tools, tool_choice

Prompts

  • file://path.txt — load from file (path relative to config)
  • Inline string with {{variable}} Nunjucks substitution
  • Chat format via JSON: [{"role": "system", "content": "..."}, {"role": "user", "content": "{{input}}"}]

Assertion Types

| Type | Use | Value | |------|-----|-------| | contains | Substring match | "expected text" | | icontains | Case-insensitive substring | "expected text" | | equals | Exact match | "exact value" | | regex | Pattern match | "\\d{4}-\\d{2}-\\d{2}" | | is-json | Valid JSON output | — | | contains-json | Output contains JSON | — | | starts-with | Prefix match | "prefix" | | cost | Max cost | threshold: 0.01 | | latency | Max response time (ms) | threshold: 5000 | | javascript | Custom JS expression | output.includes('x') | | python | Custom Python | file://check.py:fn_name | | llm-rubric | LLM-as-judge | rubric text | | similar | Semantic similarity | value: "text", threshold: 0.8 | | model-graded-factuality | Fact checking | — |

Prefix any assertion with not- to negate (e.g., not-contains).

llm-rubric

Uses an LLM to grade output against a rubric:

assert:
  - type: llm-rubric
    value: |
      The response should:
      - Mention at least 3 factors
      - Include specific examples
    threshold: 0.7
    provider: anthropic:messages:claude-sonnet-4-5-20250929

javascript

Inline expressions or functions. Access output (string) and context (with vars, prompt):

assert:
  - type: javascript
    value: output.length > 100 && output.includes('route')
  - type: javascript
    value: |
      const data = JSON.parse(output);
      return data.calories >= 200 && data.calories <= 300;

Test Organization

Split cases into separate files and reference them:

tests:
  - file://cases/basic.yaml
  - file://cases/edge-cases.yaml

Each case file contains a YAML array of test objects.

CLI

npx promptfoo eval                         # Run with auto-discovered config
npx promptfoo eval -c path/to/config.yaml  # Specific config
npx promptfoo eval --filter-metadata key=v # Filter tests
npx promptfoo view                         # Web UI for results
npx promptfoo cache clear                  # Clear result cache

References

Consult the configuration reference and Anthropic provider docs for full details.

Skills similaires