Promptfoo - LLM Evaluation Framework

CLI tool for testing and comparing LLM outputs. Create evaluation configs, test cases, and custom assertions for validating model behavior.

Sby Skills Guide Bot
TestingIntermediate1 views0 installs3/8/2026
Claude CodeCursorWindsurfCopilot
llm-testingevaluation-frameworkprompt-engineeringquality-assurancecli-tool

name: promptfoo description: | Promptfoo evaluation framework for testing and comparing LLM outputs. Use when writing eval configs, creating test cases, debugging eval runs, or working with assertions. allowed-tools:

  • Bash(npx promptfoo:*)
  • Bash(npm run evals:*)
  • WebFetch(domain:www.promptfoo.dev)

Promptfoo

Promptfoo is a CLI tool for testing and comparing LLM outputs.

Config File

The CLI auto-discovers promptfooconfig.yaml in the current directory. Use -c path for other locations.

Supported extensions: .yaml, .json, .js

Configuration

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "What this eval tests"

prompts:
  - file://prompt.txt
  - |
    Inline prompt with {{variable}} substitution

providers:
  - anthropic:messages:claude-sonnet-4-5-20250929

defaultTest:
  options:
    provider:
      config:
        temperature: 0.0
        max_tokens: 4096

tests:
  - description: "What this case tests"
    vars:
      variable: "value"
      from_file: file://data/input.txt
    assert:
      - type: contains
        value: "expected substring"

# Or load tests from files
tests: file://cases/all.yaml

outputPath: ./results.json

evaluateOptions:
  maxConcurrency: 4

Provider IDs

| Model | ID | |-------|----| | Opus 4.5 | anthropic:messages:claude-opus-4-5-20251101 | | Sonnet 4.5 | anthropic:messages:claude-sonnet-4-5-20250929 | | Haiku 4.5 | anthropic:messages:claude-haiku-4-5-20251001 |

Provider config: temperature, max_tokens, top_p, top_k, tools, tool_choice

Prompts

  • file://path.txt — load from file (path relative to config)
  • Inline string with {{variable}} Nunjucks substitution
  • Chat format via JSON: [{"role": "system", "content": "..."}, {"role": "user", "content": "{{input}}"}]

Assertion Types

| Type | Use | Value | |------|-----|-------| | contains | Substring match | "expected text" | | icontains | Case-insensitive substring | "expected text" | | equals | Exact match | "exact value" | | regex | Pattern match | "\\d{4}-\\d{2}-\\d{2}" | | is-json | Valid JSON output | — | | contains-json | Output contains JSON | — | | starts-with | Prefix match | "prefix" | | cost | Max cost | threshold: 0.01 | | latency | Max response time (ms) | threshold: 5000 | | javascript | Custom JS expression | output.includes('x') | | python | Custom Python | file://check.py:fn_name | | llm-rubric | LLM-as-judge | rubric text | | similar | Semantic similarity | value: "text", threshold: 0.8 | | model-graded-factuality | Fact checking | — |

Prefix any assertion with not- to negate (e.g., not-contains).

llm-rubric

Uses an LLM to grade output against a rubric:

assert:
  - type: llm-rubric
    value: |
      The response should:
      - Mention at least 3 factors
      - Include specific examples
    threshold: 0.7
    provider: anthropic:messages:claude-sonnet-4-5-20250929

javascript

Inline expressions or functions. Access output (string) and context (with vars, prompt):

assert:
  - type: javascript
    value: output.length > 100 && output.includes('route')
  - type: javascript
    value: |
      const data = JSON.parse(output);
      return data.calories >= 200 && data.calories <= 300;

Test Organization

Split cases into separate files and reference them:

tests:
  - file://cases/basic.yaml
  - file://cases/edge-cases.yaml

Each case file contains a YAML array of test objects.

CLI

npx promptfoo eval                         # Run with auto-discovered config
npx promptfoo eval -c path/to/config.yaml  # Specific config
npx promptfoo eval --filter-metadata key=v # Filter tests
npx promptfoo view                         # Web UI for results
npx promptfoo cache clear                  # Clear result cache

References

Consult the configuration reference and Anthropic provider docs for full details.

Related skills