Skip to content

Base Classes

This module contains the base classes for building custom evaluators and test cases.

BaseEvaluator

ragpill.base.BaseEvaluator dataclass

BaseEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False)

Bases: Evaluator

Base class for all evaluators.

All custom evaluators must inherit from this class and implement:

  1. from_csv_line class method - for CSV integration with load_testset
  2. run async method - for evaluation logic

Attributes:

Name Type Description
evaluation_name UUID

Unique identifier for this evaluator instance

expected bool | None

Whether we expect this check to pass. Defaults to None, which means the value is inherited from the case's TestCaseMetadata.expected at evaluation time. If neither evaluator nor case metadata sets it, defaults to True. For non-global evaluators, an explicit evaluator value takes precedence over case metadata. For global evaluators, case metadata takes precedence. attributes: Dictionary for additional metadata (populated from extra CSV columns)

tags set[str]

List of tags for organization and filtering

is_global bool

Whether this evaluator applies to all test cases

Note

The 'check' parameter is only used in from_csv_line() to pass configuration when creating the evaluator - it's not stored as a class attribute.

See Also

ragpill.csv.testset.load_testset: Create datasets from CSV files

metadata property

metadata

Build metadata from evaluator fields.

from_csv_line classmethod

from_csv_line(expected, tags, check, **kwargs)

Create an evaluator from a CSV line.

This class method is required for CSV integration with load_testset. The signature must be exactly as shown. Subclasses can override this method to customize how they parse the check parameter or handle additional configuration.

Custom Attributes

Any additional CSV columns beyond the standard ones (Question, test_type, expected, tags, check) will be passed as **kwargs and stored in the evaluator's attributes dict. These can be used for metadata tracking, filtering, or custom logic.

If all evaluators for a question share the same attribute value, that attribute becomes part of the Test Case metadata and will be visible in MLflow.

Parameterization Patterns

There are two ways to parameterize custom evaluators:

  1. Environment Variables (for shared config across all instances): Use pydantic-settings BaseSettings to load from environment variables. Good for API keys, global thresholds, model names, etc.

  2. JSON in check column (for per-instance config): Parse JSON from the check parameter to get per-test configuration. Good for regex patterns, specific values, test-specific thresholds.

Parameters:

Name Type Description Default
expected bool

Whether we expect this check to pass. Set to true for normal tests (e.g., "answer should mention Paris"). Set to false for negative tests (e.g., "answer should NOT hallucinate links"). The evaluation result is compared against this expectation. When constructing evaluators programmatically (not via CSV), you can omit this to inherit the value from case metadata at evaluation time.

required
tags set[str]

Comma-separated tags string from CSV for categorization and filtering.

required
check str

Evaluator-specific configuration data. Can be JSON string or plain text. For JSON: Will be parsed and passed as **check_params to the evaluator. For plain text: Subclasses should override this method to handle their format.

required
**kwargs Any

Additional attributes from extra CSV columns (e.g., priority, category). These become part of evaluator.attributes and are used for: - Metadata tracking and filtering - MLflow logging (when shared across all evaluators of a question) - Custom evaluation logic in your evaluators

{}

Returns:

Type Description
BaseEvaluator

Instance of the evaluator class

Raises:

Type Description
NotImplementedError

If check is not valid JSON and subclass hasn't overridden this method

Example

For CSV usage examples, see the CSV Adapter Guide and Custom Evaluators Guide.

class MyEvaluator(BaseEvaluator):
    pattern: str

    @classmethod
    def from_csv_line(cls, expected: bool, tags: set[str],
                    check: str, **kwargs):
        # Parse check parameter (JSON or plain text)
        try:
            check_dict = json.loads(check)
            pattern = check_dict.get('pattern', check)
        except json.JSONDecodeError:
            pattern = check  # Use as-is

        return cls(
            expected=expected,
            tags=tags,
            attributes=kwargs,  # Contains custom CSV columns
            pattern=pattern,
        )
Source code in src/ragpill/base.py
@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any) -> "BaseEvaluator":
    """Create an evaluator from a CSV line.

    This class method is required for CSV integration with
    [`load_testset`][ragpill.csv.testset.load_testset].
    The signature must be exactly as shown. Subclasses can override this method to
    customize how they parse the check parameter or handle additional configuration.

    Custom Attributes:
        Any additional CSV columns beyond the standard ones (Question, test_type, expected,
        tags, check) will be passed as **kwargs and stored in the evaluator's
        attributes dict. These can be used for metadata tracking, filtering, or custom logic.

        If all evaluators for a question share the same attribute value, that attribute
        becomes part of the Test Case metadata and will be visible in MLflow.

    Parameterization Patterns:
        There are two ways to parameterize custom evaluators:

        1. **Environment Variables** (for shared config across all instances):
           Use pydantic-settings BaseSettings to load from environment variables.
           Good for API keys, global thresholds, model names, etc.

        2. **JSON in check column** (for per-instance config):
           Parse JSON from the check parameter to get per-test configuration.
           Good for regex patterns, specific values, test-specific thresholds.

    Args:
        expected: Whether we expect this check to pass.
                 Set to `true` for normal tests (e.g., "answer should mention Paris").
                 Set to `false` for negative tests (e.g., "answer should NOT hallucinate links").
                 The evaluation result is compared against this expectation.
                 When constructing evaluators programmatically (not via CSV), you can
                 omit this to inherit the value from case metadata at evaluation time.
        tags: Comma-separated tags string from CSV for categorization and filtering.
        check: Evaluator-specific configuration data. Can be JSON string or plain text.
               For JSON: Will be parsed and passed as **check_params to the evaluator.
               For plain text: Subclasses should override this method to handle their format.
        **kwargs: Additional attributes from extra CSV columns (e.g., priority, category).
                 These become part of `evaluator.attributes` and are used for:
                 - Metadata tracking and filtering
                 - MLflow logging (when shared across all evaluators of a question)
                 - Custom evaluation logic in your evaluators

    Returns:
        Instance of the evaluator class

    Raises:
        NotImplementedError: If check is not valid JSON and subclass hasn't overridden this method

    Example:
        For CSV usage examples, see the
        [CSV Adapter Guide](https://joelgotsch.github.io/ragpill/latest/guide/csv-adapter/) and
        [Custom Evaluators Guide](https://joelgotsch.github.io/ragpill/latest/guide/evaluators/).

        ```python
        class MyEvaluator(BaseEvaluator):
            pattern: str

            @classmethod
            def from_csv_line(cls, expected: bool, tags: set[str],
                            check: str, **kwargs):
                # Parse check parameter (JSON or plain text)
                try:
                    check_dict = json.loads(check)
                    pattern = check_dict.get('pattern', check)
                except json.JSONDecodeError:
                    pattern = check  # Use as-is

                return cls(
                    expected=expected,
                    tags=tags,
                    attributes=kwargs,  # Contains custom CSV columns
                    pattern=pattern,
                )
        ```
    """

    # Try to parse check as JSON, if it fails treat as plain text
    check_params: dict[str, Any] = {}
    if check:
        try:
            check_params = json.loads(check)
        except json.JSONDecodeError:
            # If not JSON, subclasses should override this method
            # to handle their specific format
            raise NotImplementedError(
                f"Subclasses must implement from_csv_line to handle non-JSON check format: {check}"
            )

    return cls(expected=expected, tags=tags, attributes=kwargs, **check_params)

run async

run(ctx)

The method to implement the evaluation logic. Overwrite this in subclasses.

:param ctx: The evaluator context :type ctx: EvaluatorContext[Any, Any, EvaluatorMetadata] :return: The evaluation result with reason :rtype: EvaluationReason

Source code in src/ragpill/base.py
async def run(
    self,
    ctx: EvaluatorContext[Any, Any, EvaluatorMetadata],  # pyright: ignore[reportUnusedParameter]  # ctx used by subclasses
) -> EvaluationReason:
    """
    The method to implement the evaluation logic. Overwrite this in subclasses.

    :param ctx: The evaluator context
    :type ctx: EvaluatorContext[Any, Any, EvaluatorMetadata]
    :return: The evaluation result with reason
    :rtype: EvaluationReason
    """
    raise NotImplementedError("Subclasses must implement the run method.")

TestCaseMetadata

ragpill.base.TestCaseMetadata

Bases: BaseModel

In general: For non-global evaluators the evaluator metadata takes precedence over case metadata. For global evaluators, the case metadata takes precedence over evaluator metadata. This is to allow global evaluators to set default expected values, which can be overridden by case metadata.

EvaluatorMetadata

ragpill.base.EvaluatorMetadata

Bases: BaseModel

Metadata for LLM evaluation evaluators.

See Also