Evaluators¶

What are Evaluators?¶

Evaluators are the core components that check whether an LLM output (this can include the sources that were used to create the output, tool-calls,etc.) meets specified criteria. Each evaluator performs a specific type of check and returns a pass/fail result with reasoning.

Built-in Evaluators¶

See Evaluators

Creating Custom Evaluators¶

Basic Pattern¶

All custom evaluators must: 1. Inherit from BaseEvaluator or one of the other useful Base Evaluators 2. Implement from_csv_line() class method with standard signature 3. Implement async run() method

from typing import Any
from ragpill.base import BaseEvaluator, EvaluatorMetadata
from ragpill.eval_types import EvaluationReason, EvaluatorContext

class MyEvaluator(BaseEvaluator):
    """Description of what this evaluator checks."""

    @classmethod
    def from_csv_line(
        cls,
        expected: bool,
        tags: set[str],
        check: str,
        **kwargs: Any
    ):
        """Create evaluator from CSV row data.

        This class method is required for CSV integration.
        The signature must be exactly this - do not add custom parameters here.
        Use the 'check' parameter for per-instance config (see examples below).
        """
        return cls(
            expected=expected,
            tags=tags,
            attributes=kwargs,
        )

    async def run(
        self,
        ctx: EvaluatorContext[object, object, EvaluatorMetadata],
    ) -> EvaluationReason:
        # Your evaluation logic
        passed = self._check_condition(ctx.output)

        return EvaluationReason(
            value=passed,
            reason=f"Explanation of why it {'passed' if passed else 'failed'}",
        )

    def _check_condition(self, output: str) -> bool:
        # Helper method
        return True

Parameterization Patterns¶

There are two ways to parameterize custom evaluators:

Pattern 1: Environment Variables (for shared configuration)¶

Use this for configuration shared across all instances (API keys, global thresholds, etc.):

from pydantic_settings import BaseSettings, SettingsConfigDict
from pydantic import SecretStr

class LengthEvaluatorSettings(BaseSettings):
    """Settings loaded from environment variables."""
    model_config = SettingsConfigDict(env_prefix='LENGTH_EVAL_')

    api_key: SecretStr
    min_length: int = 10
    max_length: int = 1000

class LengthEvaluator(BaseEvaluator):
    """Checks if output length is within bounds from settings."""

    settings: LengthEvaluatorSettings

    @classmethod
    def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any):
        return cls(
            expected=expected,
            tags=tags,
            attributes=kwargs,
            settings=LengthEvaluatorSettings(),
        )

    async def run(
        self,
        ctx: EvaluatorContext[object, object, EvaluatorMetadata],
    ) -> EvaluationReason:
        length = len(ctx.output)
        passed = self.settings.min_length <= length <= self.settings.max_length

        return EvaluationReason(
            value=passed,
            reason=f"Length {length} (range: {self.settings.min_length}-{self.settings.max_length})",
        )

# Set environment variables:
# export LENGTH_EVAL_MIN_LENGTH=50
# export LENGTH_EVAL_MAX_LENGTH=500

Pattern 2: JSON in check Column (for per-instance configuration)¶

Use this for parameters that vary per test case (regex patterns, specific values, etc.):

import json

class RegexEvaluator(BaseEvaluator):
    """Checks if output matches a regex pattern from check column."""

    pattern: str

    @classmethod
    def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any):
        """Parse pattern from check column (plain text or JSON)."""
        try:
            check_dict = json.loads(check)
            if isinstance(check_dict, dict):
                pattern = check_dict.get('pattern', check)
            else:
                pattern = check
        except json.JSONDecodeError:
            # Plain text - use as pattern
            pattern = check

        return cls(
            expected=expected,
            tags=tags,
            attributes=kwargs,
            pattern=pattern,
        )

    async def run(
        self,
        ctx: EvaluatorContext[object, object, EvaluatorMetadata],
    ) -> EvaluationReason:
        import re
        regex = re.compile(self.pattern, re.IGNORECASE)
        match = regex.search(ctx.output)
        passed = match is not None

        return EvaluationReason(
            value=passed,
            reason=f"Pattern '{self.pattern}' {'found' if passed else 'not found'}",
        )

# CSV examples:
# Plain text pattern:
# Question,test_type,expected,tags,check
# What is Python?,RegexEvaluator,true,tech,programming language
#
# JSON pattern with additional config:
# What is Python?,RegexEvaluator,true,tech,"{\"pattern\": \".*programming.*\"}"

Real-World Example: Built-in Evaluator¶

See the built-in RegexInDocumentMetadataEvaluator for a complete example that uses JSON configuration:

# From evaluators.py
@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any):
    """Create evaluator from CSV with JSON in check column."""
    try:
        check_dict = json.loads(check)
        if isinstance(check_dict, dict):
            pattern = check_dict.get('pattern')
            metadata_key = check_dict.get('key')
        else:
            raise ValueError("check must be a JSON object")
    except json.JSONDecodeError:
        raise ValueError(
            f"RegexInDocumentMetadataEvaluator requires 'check' to be a JSON string "
            f"with 'pattern' and 'key'. Got: {check}"
        )

    return cls(
        expected=expected,
        tags=tags,
        attributes=kwargs,
        pattern=pattern,
        metadata_key=metadata_key,
    )

# CSV usage:
# Question,test_type,expected,tags,check
# Query docs,RegexInDocumentMetadata,true,retrieval,"{\"pattern\": \".*2024.*\", \"key\": \"date\"}"

Custom Attributes¶

You can add custom attributes to evaluators by adding columns to your CSV:

Question,test_type,expected,tags,check,priority,category
What is X?,LLMJudge,true,factual,answer_correctness,high,science
What is Y?,RegexEvaluator,false,format,email_format,low,validation

These custom columns (like priority and category) are automatically: 1. Passed to each evaluator's attributes dict via the **kwargs in from_csv_line() 2. Available in your evaluator through self.attributes

Important: If all evaluators for a given question have the same value for an attribute, that attribute becomes part of the Test Case metadata and will be visible in MLflow tracking.

# In code - extend default evaluators with your custom class
from ragpill.csv.testset import load_testset, default_evaluator_classes

evaluator_classes = default_evaluator_classes | {
    'MyEvaluator': MyEvaluator,
}

dataset = load_testset(
    csv_path="testset.csv",
    evaluator_classes=evaluator_classes,
)

# Access attributes in your evaluator
class MyEvaluator(BaseEvaluator):
    @classmethod
    def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any):
        return cls(
            expected=expected,
            tags=tags,
            attributes=kwargs,  # Contains {priority: "high", category: "science"}
        )

    async def run(self, ctx):
        priority = self.attributes.get('priority', 'medium')
        # Use the attribute in your logic
        ...

Multiple Evaluators¶

You can attach multiple evaluators to a single test case:

case = Case(
    input=BaseTestInput(metadata=metadata),
    evaluators=[
        LLMJudge(...),  # Check correctness
        LengthEvaluator(...),  # Check length
        RegexEvaluator(...),  # Check format
    ],
)

expected Inheritance¶

When constructing evaluators programmatically, expected defaults to None. At evaluation time it is resolved via merge_metadata:

Non-global evaluators: evaluator value wins; if None, falls back to case metadata.
Global evaluators: case metadata wins; if None, falls back to evaluator value.
If both are None, the final default is True.

This lets you set a case-wide default and override it only where needed:

Case(
    inputs="How many r's are in 'strawberry'?",
    metadata=TestCaseMetadata(expected=False),  # default for all evaluators
    evaluators=[
        RegexInOutputEvaluator(pattern="1"),              # inherits expected=False
        RegexInOutputEvaluator(pattern="2"),              # inherits expected=False
        RegexInOutputEvaluator(pattern="3", expected=True),  # overrides to True
        RegexInOutputEvaluator(pattern="4"),              # inherits expected=False
    ],
)

Evaluator Results¶

EvaluationReason Structure¶

Each evaluator's run() method returns an EvaluationReason:

EvaluationReason(
    value=True,                  # Whether the check passed
    reason="Explanation...",     # Human-readable explanation
)

Best Practices¶

1. Clear Naming¶

Use descriptive names for your evaluators:

# Good
class EmailFormatEvaluator(BaseEvaluator): ...
class SentimentPositivityEvaluator(BaseEvaluator): ...

# Avoid
class Evaluator1(BaseEvaluator): ...
class CheckThing(BaseEvaluator): ...

2. Informative Reasons¶

Provide helpful explanations in reason:

# Good
reason = f"Found 3 of 5 required keywords: {found_keywords}"

# Less helpful
reason = "Failed"

3. Deterministic When Possible¶

Prefer deterministic checks over LLM judges when possible:

Regex for format validation
Length for size constraints
Keyword matching for required terms

Use LLM judges for: - Semantic correctness - Tone and style

5. Document Your Evaluators¶

Add docstrings explaining what and why:

class KeywordEvaluator(BaseEvaluator):
    """Checks if specific keywords are present in the output.

    Useful for ensuring important terms are mentioned without
    requiring exact phrasing. Can check for any keyword (OR logic)
    or all keywords (AND logic).

    Args:
        keywords: List of keywords to check for
        require_all: If True, all keywords must be present (AND).
                    If False, any keyword is sufficient (OR).
    """

Handling of Evaluator Failures¶

By default, if an evaluator raises an exception during run(), it will be treated as a failure (i.e., value=False) and the exception message will be recorded in the reason. This ensures that unexpected errors in evaluators do not crash the entire evaluation process and are properly logged for debugging.