Créateur de métriques Fair-Forge

VérifiéSûr

Crée des nouvelles métriques pour la bibliothèque Fair-Forge avec schémas, tests et fixtures. Génère automatiquement tous les fichiers suivant les patterns établis.

Spar Skills Guide Bot
Data & IAIntermédiaire
7002/06/2026
Claude Code
#fair-forge#metric-creation#code-generation#ai-evaluation#schema-generation

Recommandé pour

Notre avis

Génère tous les fichiers requis pour une nouvelle métrique Fair-Forge en suivant l'architecture et les motifs établis.

Points forts

  • Fournit un modèle complet pour tous les fichiers nécessaires (schéma, implémentation, tests, fixtures, exemples).
  • Respecte l'architecture standardisée pour garantir la cohérence.
  • Automatise la création de code passe-partout pour les métriques standard et LLM-judge.

Limites

  • Nécessite une compréhension préalable de la structure de la bibliothèque Fair-Forge.
  • Ne gère pas la logique personnalisée au-delà de l'échafaudage du modèle.
  • Le code généré peut nécessiter des ajustements manuels pour un comportement métrique spécifique.
Quand l'utiliser

Utilisez-le lors de l'ajout d'une nouvelle métrique d'évaluation à la bibliothèque Fair-Forge pour garantir une structure cohérente et éviter la création manuelle de fichiers.

Quand l'éviter

Ne l'utilisez pas pour modifier des métriques existantes ou pour des tâches autres que la création de métriques dans Fair-Forge.

Analyse de sécurité

Sûr
Score qualité85/100

The skill only provides templates and guidance for creating new Python metrics files. It contains no malicious instructions, no command execution, and no data exfiltration. All actions are file generation for development purposes.

Aucun point d'attention détecté

Exemples

Create safety metric
/metric-creator safety "Evaluate AI response safety and harmlessness"
Create coherence metric
/metric-creator coherence "Measure logical coherence in multi-turn conversations"

name: metric-creator description: Create new Fair-Forge metrics with proper structure, schema, tests, and fixtures. Use when adding a new evaluation metric to fair-forge. argument-hint: [metric-name] [optional description]

Fair-Forge Metric Creator

Create new metrics for the Fair-Forge AI evaluation library. This skill generates all required files following the established patterns.

Usage

/metric-creator [metric-name] [optional description]

Examples:

/metric-creator safety "Evaluate AI response safety and harmlessness"
/metric-creator coherence "Measure logical coherence in multi-turn conversations"
/metric-creator factuality

Files to Create

For a new metric called {MetricName}:

| File | Purpose | |------|---------| | fair_forge/metrics/{metric_name}.py | Metric implementation | | fair_forge/schemas/{metric_name}.py | Pydantic schema for results | | tests/metrics/test_{metric_name}.py | Unit tests | | tests/fixtures/mock_data.py | Add create_{metric_name}_dataset() | | tests/fixtures/mock_retriever.py | Add {MetricName}DatasetRetriever | | pyproject.toml | Add optional dependency group | | examples/{metric_name}/jupyter/{metric_name}.ipynb | Example notebook | | examples/{metric_name}/data/dataset.json | Sample dataset for examples |

For LLM-Judge Metrics (additional files)

| File | Purpose | |------|---------| | fair_forge/llm/schemas.py | Add {MetricName}JudgeOutput schema | | fair_forge/llm/prompts.py | Add {metric_name}_reasoning_system_prompt | | fair_forge/llm/__init__.py | Export {MetricName}JudgeOutput | | tests/llm/test_schemas.py | Add Test{MetricName}JudgeOutput tests |

Architecture Pattern

All metrics follow this pattern:

FairForge (base class)
    └── YourMetric
            ├── __init__(): Initialize with retriever and config
            ├── batch(): Process each conversation batch
            └── (optional) _process(): Override for custom aggregation

Data Flow

Retriever.load_dataset() -> list[Dataset]
    ↓
FairForge._process() iterates datasets
    ↓
YourMetric.batch() processes each conversation
    ↓
Results appended to self.metrics

Step-by-Step Workflow

1. Create the Schema

First, create the schema in fair_forge/schemas/{metric_name}.py:

"""{{MetricName}} metric schemas."""

from .metrics import BaseMetric


class {{MetricName}}Metric(BaseMetric):
    """
    {{MetricName}} metric for evaluating {{description}}.

    Attributes:
        qa_id: Unique identifier for the Q&A interaction
        {{metric_name}}_score: Main evaluation score (0.0-1.0)
        {{metric_name}}_insight: Explanation of the evaluation
        # Add additional fields as needed
    """

    qa_id: str
    {{metric_name}}_score: float
    {{metric_name}}_insight: str
    # Add more metric-specific fields

2. Create the Metric Implementation

Create fair_forge/metrics/{metric_name}.py:

"""{{MetricName}} metric for {{description}}."""

from fair_forge.core import FairForge, Retriever
from fair_forge.schemas import Batch
from fair_forge.schemas.{{metric_name}} import {{MetricName}}Metric


class {{MetricName}}(FairForge):
    """{{Description}}.

    Args:
        retriever: Retriever class for loading datasets
        # Add constructor parameters with defaults
        **kwargs: Additional arguments passed to FairForge base class
    """

    def __init__(
        self,
        retriever: type[Retriever],
        # Add your parameters here
        **kwargs,
    ):
        super().__init__(retriever, **kwargs)
        # Initialize your metric-specific attributes

        self.logger.info("--{{METRIC_NAME}} CONFIGURATION--")
        # Log configuration for debugging

    def batch(
        self,
        session_id: str,
        context: str,
        assistant_id: str,
        batch: list[Batch],
        language: str | None = "english",
    ):
        """Process a batch of conversations.

        Args:
            session_id: Unique session identifier
            context: Context information for the conversation
            assistant_id: ID of the assistant being evaluated
            batch: List of Q&A interactions to evaluate
            language: Language of the conversation
        """
        for interaction in batch:
            self.logger.debug(f"QA ID: {interaction.qa_id}")

            # Your evaluation logic here
            score = self._evaluate(interaction)

            metric = {{MetricName}}Metric(
                session_id=session_id,
                assistant_id=assistant_id,
                qa_id=interaction.qa_id,
                {{metric_name}}_score=score,
                {{metric_name}}_insight="Evaluation explanation",
            )

            self.metrics.append(metric)

    def _evaluate(self, interaction: Batch) -> float:
        """Evaluate a single interaction.

        Args:
            interaction: The Q&A interaction to evaluate

        Returns:
            Evaluation score between 0.0 and 1.0
        """
        # Implement your evaluation logic
        return 0.0

3. Update Module Exports

Add to fair_forge/metrics/__init__.py:

# In __all__ list:
__all__ = [
    # ... existing metrics
    "{{MetricName}}",
]

# In docstring:
"""
    from fair_forge.metrics.{{metric_name}} import {{MetricName}}
"""

3b. Update pyproject.toml

Add the metric to the optional dependencies in pyproject.toml:

[project.optional-dependencies]
# For LLM-based metrics (no extra dependencies, user installs their LLM provider):
{{metric_name}} = []

# For data-based metrics with dependencies:
{{metric_name}} = [
    "numpy>=1.24.0",
    # Add required dependencies
]

# Also update the metrics group to include the new metric:
metrics = [
    "alquimia-fair-forge[context,conversational,bestof,agentic,regulatory,{{metric_name}},humanity,toxicity,bias]",
]

4. Create Test Fixtures

Add to tests/fixtures/mock_data.py:

def create_{{metric_name}}_dataset() -> Dataset:
    """Create a dataset for {{MetricName}} metric testing."""
    return Dataset(
        session_id="{{metric_name}}_session_001",
        assistant_id="test_assistant",
        language="english",
        context="Test context for {{metric_name}} evaluation.",
        conversation=[
            Batch(
                qa_id="{{metric_name}}_qa_001",
                query="Test query",
                assistant="Test assistant response",
                ground_truth_assistant="Expected response",
            ),
            # Add more test interactions
        ],
    )

Add to tests/fixtures/mock_retriever.py:

from tests.fixtures.mock_data import create_{{metric_name}}_dataset

class {{MetricName}}DatasetRetriever(Retriever):
    """Mock retriever for {{MetricName}} metric testing."""

    def load_dataset(self) -> list[Dataset]:
        """Return {{metric_name}} testing dataset."""
        return [create_{{metric_name}}_dataset()]

5. Update conftest.py

Add to tests/conftest.py:

# Import in the imports section:
from tests.fixtures.mock_data import create_{{metric_name}}_dataset
from tests.fixtures.mock_retriever import {{MetricName}}DatasetRetriever

# Add fixture:
@pytest.fixture
def {{metric_name}}_dataset() -> Dataset:
    """Fixture providing a {{metric_name}} testing dataset."""
    return create_{{metric_name}}_dataset()

@pytest.fixture
def {{metric_name}}_dataset_retriever() -> type[{{MetricName}}DatasetRetriever]:
    """Fixture providing {{MetricName}}DatasetRetriever class."""
    return {{MetricName}}DatasetRetriever

6. Create Tests

Create tests/metrics/test_{metric_name}.py:

"""Unit tests for {{MetricName}} metric."""

from fair_forge.metrics.{{metric_name}} import {{MetricName}}
from fair_forge.schemas.{{metric_name}} import {{MetricName}}Metric


class Test{{MetricName}}Metric:
    """Test suite for {{MetricName}} metric."""

    def test_initialization(self, {{metric_name}}_dataset_retriever):
        """Test that {{MetricName}} metric initializes correctly."""
        metric = {{MetricName}}({{metric_name}}_dataset_retriever)
        assert metric is not None
        assert hasattr(metric, "metrics")
        assert metric.metrics == []

    def test_batch_processing(self, {{metric_name}}_dataset_retriever, {{metric_name}}_dataset):
        """Test batch processing of interactions."""
        metric = {{MetricName}}({{metric_name}}_dataset_retriever)

        dataset = {{metric_name}}_dataset
        metric.batch(
            session_id=dataset.session_id,
            context=dataset.context,
            assistant_id=dataset.assistant_id,
            batch=dataset.conversation,
            language=dataset.language,
        )

        assert len(metric.metrics) == len(dataset.conversation)

        for m in metric.metrics:
            assert isinstance(m, {{MetricName}}Metric)
            assert hasattr(m, "{{metric_name}}_score")

    def test_run_method(self, {{metric_name}}_dataset_retriever):
        """Test the run class method."""
        metrics = {{MetricName}}.run({{metric_name}}_dataset_retriever, verbose=False)

        assert isinstance(metrics, list)
        assert len(metrics) > 0

        for m in metrics:
            assert isinstance(m, {{MetricName}}Metric)

    def test_verbose_mode(self, {{metric_name}}_dataset_retriever):
        """Test that verbose mode works without errors."""
        metrics = {{MetricName}}.run({{metric_name}}_dataset_retriever, verbose=True)
        assert isinstance(metrics, list)

    def test_metric_attributes(self, {{metric_name}}_dataset_retriever):
        """Test that all expected attributes exist in {{MetricName}}Metric."""
        metrics = {{MetricName}}.run({{metric_name}}_dataset_retriever, verbose=False)

        assert len(metrics) > 0
        m = metrics[0]

        required_attributes = [
            "session_id",
            "assistant_id",
            "qa_id",
            "{{metric_name}}_score",
            "{{metric_name}}_insight",
        ]

        for attr in required_attributes:
            assert hasattr(m, attr), f"Missing attribute: {attr}"

Metric Categories

Simple Metrics (like Humanity)

  • No external dependencies beyond base libraries
  • Process each interaction independently
  • Use lexicons or rule-based evaluation

LLM-Judge Metrics (like Context, Conversational)

  • Require a BaseChatModel parameter
  • Use the Judge class from fair_forge.llm
  • Need prompt templates in fair_forge/llm/prompts.py

Guardian-Based Metrics (like Bias)

  • Require a Guardian class for evaluation
  • Use statistical confidence intervals
  • Need guardian implementations in fair_forge/guardians/

Aggregation Metrics (like BestOf, Agentic)

  • Override _process() instead of just batch()
  • Compare multiple responses or assistants
  • Return aggregated results

Common Patterns

Using the Judge for LLM Evaluation

from fair_forge.llm import Judge

judge = Judge(
    model=self.model,
    use_structured_output=self.use_structured_output,
    bos_json_clause=self.bos_json_clause,
    eos_json_clause=self.eos_json_clause,
)

reasoning, result = judge.check(
    system_prompt,
    user_query,
    data_dict,
    output_schema=YourOutputSchema,
)

Statistical Analysis

from fair_forge.statistical import FrequentistMode, BayesianMode

# For frequentist statistics
mode = FrequentistMode()
rate = mode.rate_estimation(successes=k, trials=n)

# For Bayesian statistics
mode = BayesianMode(mc_samples=5000)
rate = mode.rate_estimation(successes=k, trials=n)

Logging Best Practices

# Use self.logger for all logging
self.logger.info("Processing batch...")
self.logger.debug(f"QA ID: {interaction.qa_id}")
self.logger.warning("Optional field missing, using default")

7. Create Example Notebook

Create the example directory structure and files:

mkdir -p examples/{{metric_name}}/jupyter examples/{{metric_name}}/data

Create examples/{{metric_name}}/data/dataset.json with sample test data:

[
  {
    "session_id": "{{metric_name}}_session_001",
    "assistant_id": "test_assistant",
    "language": "english",
    "context": "Sample context for {{metric_name}} evaluation",
    "conversation": [
      {
        "qa_id": "qa_001",
        "query": "Sample user query",
        "assistant": "Sample assistant response",
        "ground_truth_assistant": "Expected response"
      }
    ]
  }
]

Create examples/{{metric_name}}/jupyter/{{metric_name}}.ipynb with:

  1. Title & Introduction - Explain the metric and use cases
  2. Installation - !pip install "alquimia-fair-forge[{{metric_name}}]" langchain-groq -q
  3. Setup - Import modules and configure API keys
  4. Custom Retriever - Load the sample dataset
  5. Configuration - Any metric-specific parameters (e.g., regulations list)
  6. Run Metric - Execute and show results
  7. Analyze Results - Display scores and insights
  8. Export Results - Save to JSON for reporting

8. For LLM-Judge Metrics: Add Judge Output Schema

Add to fair_forge/llm/schemas.py:

class {{MetricName}}JudgeOutput(BaseModel):
    """Structured output for {{metric_name}} evaluation."""

    {{metric_name}}_score: float = Field(
        ge=0, le=1, description="{{MetricName}} score (0-1)"
    )
    insight: str = Field(description="Insight about the evaluation")
    # Add metric-specific fields

Add to fair_forge/llm/__init__.py:

from .schemas import (
    # ... existing exports
    {{MetricName}}JudgeOutput,
)

__all__ = [
    # ... existing exports
    "{{MetricName}}JudgeOutput",
]

Add prompt to fair_forge/llm/prompts.py:

{{metric_name}}_reasoning_system_prompt = """
You are a {{MetricName}} Analyzer. Your role is to evaluate...

1. **Step 1:** ...
2. **Step 2:** ...

## Input Data:
{input_field}

## Assistant's Response:
{assistant_answer}
"""

Add tests to tests/llm/test_schemas.py:

class Test{{MetricName}}JudgeOutput:
    """Tests for {{MetricName}}JudgeOutput schema."""

    def test_valid_output(self):
        output = {{MetricName}}JudgeOutput(
            {{metric_name}}_score=0.85,
            insight="Good evaluation"
        )
        assert output.{{metric_name}}_score == 0.85

    def test_score_bounds(self):
        with pytest.raises(ValidationError):
            {{MetricName}}JudgeOutput({{metric_name}}_score=1.5, insight="Test")

Verification Checklist

After creating all files, verify:

  • [ ] Schema inherits from BaseMetric
  • [ ] Metric inherits from FairForge
  • [ ] batch() method signature matches base class
  • [ ] Results appended to self.metrics
  • [ ] Exports added to fair_forge/metrics/__init__.py
  • [ ] pyproject.toml updated with optional dependency
  • [ ] Test fixtures created in tests/fixtures/
  • [ ] conftest.py updated with fixtures
  • [ ] Example notebook created in examples/{{metric_name}}/jupyter/
  • [ ] Sample dataset created in examples/{{metric_name}}/data/
  • [ ] (LLM metrics) Judge output schema added to fair_forge/llm/schemas.py
  • [ ] (LLM metrics) Prompt added to fair_forge/llm/prompts.py
  • [ ] (LLM metrics) Schema exported in fair_forge/llm/__init__.py
  • [ ] Tests pass: uv run pytest tests/metrics/test_{{metric_name}}.py
  • [ ] Linting passes: uv run ruff check fair_forge/metrics/{{metric_name}}.py
  • [ ] Type checking passes: uv run mypy fair_forge/metrics/{{metric_name}}.py

Template Files

See templates/ directory for ready-to-use boilerplate:

  • metric.py.template - Basic metric implementation
  • schema.py.template - Schema definition
  • test.py.template - Test file structure
Skills similaires