Base Classes¶

This module contains the base classes for building custom evaluators and test cases.

BaseEvaluator¶

ragpill.base.BaseEvaluator `dataclass` ¶

BaseEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False)

Base class for all ragpill evaluators.

All custom evaluators must inherit from this class and implement:

from_csv_line class method - for CSV integration with load_testset
run async method - for evaluation logic

Attributes:

Name	Type	Description
`evaluation_name`	`UUID`	Unique identifier for this evaluator instance
`expected`	`bool \| None`	Whether we expect this check to pass. Defaults to None, which means the value is inherited from the case's TestCaseMetadata.expected at evaluation time. If neither evaluator nor case metadata sets it, defaults to True. For non-global evaluators, an explicit evaluator value takes precedence over case metadata. For global evaluators, case metadata takes precedence. attributes: Dictionary for additional metadata (populated from extra CSV columns)
`tags`	`set[str]`	List of tags for organization and filtering
`is_global`	`bool`	Whether this evaluator applies to all test cases

Note

The 'check' parameter is only used in from_csv_line() to pass configuration when creating the evaluator - it's not stored as a class attribute.

metadata `property` ¶

metadata

Build metadata from evaluator fields.

build_serialization_arguments ¶

build_serialization_arguments()

Return a dict of non-default field values for this evaluator.

Iterates over the dataclass fields and returns those whose value differs from the declared default (either field.default or field.default_factory()). Useful for logging/debugging evaluator configuration.

Returns:

Type	Description
`dict[str, Any]`	A dictionary mapping field names to their non-default values.

Source code in src/ragpill/base.py

def build_serialization_arguments(self) -> dict[str, Any]:
    """Return a dict of non-default field values for this evaluator.

    Iterates over the dataclass fields and returns those whose value differs
    from the declared default (either ``field.default`` or
    ``field.default_factory()``). Useful for logging/debugging evaluator
    configuration.

    Returns:
        A dictionary mapping field names to their non-default values.
    """
    return _build_serialization_arguments(self)

evaluate `async` ¶

evaluate(ctx)

Run the evaluator and apply the expected polarity logic.

Calls :meth:run and then flips the result value when the merged metadata's expected flag is False.

Parameters:

Name	Type	Description	Default
`ctx`	`EvaluatorContext[Any, Any, EvaluatorMetadata]`	The evaluator context containing inputs, output, and metadata.	required

Returns:

Type	Description
`EvaluationReason`	The evaluation result with the `expected` polarity applied.

Source code in src/ragpill/base.py

async def evaluate(
    self,
    ctx: EvaluatorContext[Any, Any, EvaluatorMetadata],
) -> EvaluationReason:
    """Run the evaluator and apply the ``expected`` polarity logic.

    Calls :meth:`run` and then flips the result value when the merged
    metadata's ``expected`` flag is ``False``.

    Args:
        ctx: The evaluator context containing inputs, output, and metadata.

    Returns:
        The evaluation result with the ``expected`` polarity applied.
    """
    # handle common logic for expected:
    eval_result = await self.run(ctx)

    # if eval_result.value is None:
    #     return eval_result
    assert isinstance(eval_result.value, bool), "Evaluator must return a boolean value."
    assert isinstance(ctx.metadata, TestCaseMetadata), "Expected TestCaseMetadata from context."
    merged_metadata = merge_metadata(case_metadata=ctx.metadata, evaluator_metadata=self.metadata)
    eval_result.value = eval_result.value == merged_metadata.expected
    return eval_result

from_csv_line `classmethod` ¶

from_csv_line(expected, tags, check, **kwargs)

Create an evaluator from a CSV line.

This class method is required for CSV integration with load_testset. The signature must be exactly as shown. Subclasses can override this method to customize how they parse the check parameter or handle additional configuration.

Custom Attributes

Any additional CSV columns beyond the standard ones (Question, test_type, expected, tags, check) will be passed as **kwargs and stored in the evaluator's attributes dict. These can be used for metadata tracking, filtering, or custom logic.

If all evaluators for a question share the same attribute value, that attribute becomes part of the Test Case metadata and will be visible in MLflow.

Parameterization Patterns

There are two ways to parameterize custom evaluators:

Environment Variables (for shared config across all instances): Use pydantic-settings BaseSettings to load from environment variables. Good for API keys, global thresholds, model names, etc.
JSON in check column (for per-instance config): Parse JSON from the check parameter to get per-test configuration. Good for regex patterns, specific values, test-specific thresholds.

Parameters:

Name	Type	Description	Default
`expected`	`bool`	Whether we expect this check to pass. Set to `true` for normal tests (e.g., "answer should mention Paris"). Set to `false` for negative tests (e.g., "answer should NOT hallucinate links"). The evaluation result is compared against this expectation. When constructing evaluators programmatically (not via CSV), you can omit this to inherit the value from case metadata at evaluation time.	required
`tags`	`set[str]`	Comma-separated tags string from CSV for categorization and filtering.	required
`check`	`str`	Evaluator-specific configuration data. Can be JSON string or plain text. For JSON: Will be parsed and passed as **check_params to the evaluator. For plain text: Subclasses should override this method to handle their format.	required
`**kwargs`	`Any`	Additional attributes from extra CSV columns (e.g., priority, category). These become part of `evaluator.attributes` and are used for: - Metadata tracking and filtering - MLflow logging (when shared across all evaluators of a question) - Custom evaluation logic in your evaluators	`{}`

Returns:

Type	Description
`BaseEvaluator`	Instance of the evaluator class

Raises:

Type	Description
`NotImplementedError`	If check is not valid JSON and subclass hasn't overridden this method

Example

For CSV usage examples, see the CSV Adapter Guide and Custom Evaluators Guide.

class MyEvaluator(BaseEvaluator):
    pattern: str

    @classmethod
    def from_csv_line(cls, expected: bool, tags: set[str],
                    check: str, **kwargs):
        # Parse check parameter (JSON or plain text)
        try:
            check_dict = json.loads(check)
            pattern = check_dict.get('pattern', check)
        except json.JSONDecodeError:
            pattern = check  # Use as-is

        return cls(
            expected=expected,
            tags=tags,
            attributes=kwargs,  # Contains custom CSV columns
            pattern=pattern,
        )

Source code in src/ragpill/base.py

@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any) -> BaseEvaluator:
    """Create an evaluator from a CSV line.

    This class method is required for CSV integration with
    [`load_testset`][ragpill.csv.testset.load_testset].
    The signature must be exactly as shown. Subclasses can override this method to
    customize how they parse the check parameter or handle additional configuration.

    Custom Attributes:
        Any additional CSV columns beyond the standard ones (Question, test_type, expected,
        tags, check) will be passed as **kwargs and stored in the evaluator's
        attributes dict. These can be used for metadata tracking, filtering, or custom logic.

        If all evaluators for a question share the same attribute value, that attribute
        becomes part of the Test Case metadata and will be visible in MLflow.

    Parameterization Patterns:
        There are two ways to parameterize custom evaluators:

        1. **Environment Variables** (for shared config across all instances):
           Use pydantic-settings BaseSettings to load from environment variables.
           Good for API keys, global thresholds, model names, etc.

        2. **JSON in check column** (for per-instance config):
           Parse JSON from the check parameter to get per-test configuration.
           Good for regex patterns, specific values, test-specific thresholds.

    Args:
        expected: Whether we expect this check to pass.
                 Set to `true` for normal tests (e.g., "answer should mention Paris").
                 Set to `false` for negative tests (e.g., "answer should NOT hallucinate links").
                 The evaluation result is compared against this expectation.
                 When constructing evaluators programmatically (not via CSV), you can
                 omit this to inherit the value from case metadata at evaluation time.
        tags: Comma-separated tags string from CSV for categorization and filtering.
        check: Evaluator-specific configuration data. Can be JSON string or plain text.
               For JSON: Will be parsed and passed as **check_params to the evaluator.
               For plain text: Subclasses should override this method to handle their format.
        **kwargs: Additional attributes from extra CSV columns (e.g., priority, category).
                 These become part of `evaluator.attributes` and are used for:
                 - Metadata tracking and filtering
                 - MLflow logging (when shared across all evaluators of a question)
                 - Custom evaluation logic in your evaluators

    Returns:
        Instance of the evaluator class

    Raises:
        NotImplementedError: If check is not valid JSON and subclass hasn't overridden this method

    Example:
        For CSV usage examples, see the
        [CSV Adapter Guide](https://joelgotsch.github.io/ragpill/latest/guide/csv-adapter/) and
        [Custom Evaluators Guide](https://joelgotsch.github.io/ragpill/latest/guide/evaluators/).

        ```python
        class MyEvaluator(BaseEvaluator):
            pattern: str

            @classmethod
            def from_csv_line(cls, expected: bool, tags: set[str],
                            check: str, **kwargs):
                # Parse check parameter (JSON or plain text)
                try:
                    check_dict = json.loads(check)
                    pattern = check_dict.get('pattern', check)
                except json.JSONDecodeError:
                    pattern = check  # Use as-is

                return cls(
                    expected=expected,
                    tags=tags,
                    attributes=kwargs,  # Contains custom CSV columns
                    pattern=pattern,
                )
        ```
    """

    # Try to parse check as JSON, if it fails treat as plain text
    check_params: dict[str, Any] = {}
    if check:
        try:
            check_params = json.loads(check)
        except json.JSONDecodeError:
            # If not JSON, subclasses should override this method
            # to handle their specific format
            raise NotImplementedError(
                f"Subclasses must implement from_csv_line to handle non-JSON check format: {check}"
            )

    return cls(expected=expected, tags=tags, attributes=kwargs, **check_params)

get_serialization_name `classmethod` ¶

get_serialization_name()

Return the class name used to identify this evaluator.

Returns:

Type	Description
`str`	The evaluator's class name.

Source code in src/ragpill/base.py

@classmethod
def get_serialization_name(cls) -> str:
    """Return the class name used to identify this evaluator.

    Returns:
        The evaluator's class name.
    """
    return cls.__name__

run `async` ¶

run(ctx)

Implement the evaluation logic. Overwrite this in subclasses.

Parameters:

Name	Type	Description	Default
`ctx`	`EvaluatorContext[Any, Any, EvaluatorMetadata]`	The evaluator context containing inputs, output, and metadata.	required

Returns:

Type	Description
`EvaluationReason`	The evaluation result with a boolean value and reason string.

Source code in src/ragpill/base.py

async def run(
    self,
    ctx: EvaluatorContext[Any, Any, EvaluatorMetadata],  # pyright: ignore[reportUnusedParameter]  # ctx used by subclasses
) -> EvaluationReason:
    """Implement the evaluation logic. Overwrite this in subclasses.

    Args:
        ctx: The evaluator context containing inputs, output, and metadata.

    Returns:
        The evaluation result with a boolean value and reason string.
    """
    raise NotImplementedError("Subclasses must implement the run method.")

TestCaseMetadata¶

ragpill.base.TestCaseMetadata ¶

Bases: BaseModel

In general: For non-global evaluators the evaluator metadata takes precedence over case metadata. For global evaluators, the case metadata takes precedence over evaluator metadata. This is to allow global evaluators to set default expected values, which can be overridden by case metadata.

EvaluatorMetadata¶

ragpill.base.EvaluatorMetadata ¶

Bases: BaseModel

Metadata for LLM evaluation evaluators.

resolve_repeat¶

ragpill.base.resolve_repeat ¶

resolve_repeat(case_metadata, settings)

Resolve effective repeat count and pass threshold from per-case override or global default.

Parameters:

Name	Type	Description	Default
`case_metadata`	`TestCaseMetadata \| None`	Per-case metadata (may be None or have None fields).	required
`settings`	`MLFlowSettings`	Global MLFlowSettings providing default repeat/threshold.	required

Returns:

Type	Description
`tuple[int, float]`	Tuple of (repeat, threshold) with per-case values taking precedence over globals.

Source code in src/ragpill/base.py

def resolve_repeat(case_metadata: TestCaseMetadata | None, settings: MLFlowSettings) -> tuple[int, float]:
    """Resolve effective repeat count and pass threshold from per-case override or global default.

    Args:
        case_metadata: Per-case metadata (may be None or have None fields).
        settings: Global MLFlowSettings providing default repeat/threshold.

    Returns:
        Tuple of (repeat, threshold) with per-case values taking precedence over globals.
    """
    repeat = case_metadata.repeat if (case_metadata and case_metadata.repeat is not None) else settings.ragpill_repeat
    threshold = (
        case_metadata.threshold
        if (case_metadata and case_metadata.threshold is not None)
        else settings.ragpill_threshold
    )
    return repeat, threshold

Base Classes¶

BaseEvaluator¶

ragpill.base.BaseEvaluator dataclass ¶

metadata property ¶

build_serialization_arguments ¶

evaluate async ¶

from_csv_line classmethod ¶

get_serialization_name classmethod ¶

run async ¶

TestCaseMetadata¶

ragpill.base.TestCaseMetadata ¶

EvaluatorMetadata¶

ragpill.base.EvaluatorMetadata ¶

resolve_repeat¶

ragpill.base.resolve_repeat ¶

See Also¶

ragpill.base.BaseEvaluator `dataclass` ¶

metadata `property` ¶

evaluate `async` ¶

from_csv_line `classmethod` ¶

get_serialization_name `classmethod` ¶

run `async` ¶