Skip to content

Evaluators

This module provides pre-built evaluators for common evaluation tasks.

As the task can have arbitrary input and output types these evaluators in general coerce those to strings and use string-based evaluation methods (LLM judges, regex checks, etc).

It is, however, pretty easy to create your own custom evaluators by inheriting from BaseEvaluator and implementing the run() method. See Create custom evaluators for a tutorial on how to do this.

LLMJudge

ragpill.evaluators.LLMJudge dataclass

LLMJudge(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, rubric, model=_get_default_judge_llm(), include_input=False)

Bases: BaseEvaluator

The LLMJudge evaluator uses a language model to judge whether an output meets specified rubric.

A rubric usually is one of the following: - A fact that the output should contain or not contain (rubric="Output must contain the fact that Paris is the capital of France.") - About the style of the output (rubric="Output should be in a formal tone." or "Output should be in German")

Note: Avoid complex instructions in the rubric, as the model may not follow them reliably. Instead, try to break it down into multiple instances of the LLMJudge.

metadata property

metadata

Build metadata from evaluator fields.

the default in BaseEvaluator is overridden because excluding the not pickleable model field seems impossible

from_csv_line classmethod

from_csv_line(expected, tags, check, get_llm=_get_default_judge_llm, **kwargs)

Create an LLMJudge from a CSV line.

This method is used by the CSV testset loader to instantiate the evaluator. See load_testset for more details.

For LLMJudge, the check parameter is treated as the rubric text. If check is a JSON object with a 'rubric' key, that value is used. Otherwise, the entire check string is used as the rubric.

Parameters:

Name Type Description Default
expected bool

Expected evaluation result

required
tags set[str]

Comma-separated tags string

required
check str

Rubric text or JSON with 'rubric' key

required
get_llm Callable[[], Model]

Callable that returns a Model instance (defaults to get_default_judge_llm)

_get_default_judge_llm
**kwargs Any

Additional parameters (can include 'model' to override default)

{}

Note: The model parameter must be provided. It should come from: - Dependency injection (e.g., a module-level or class-level settings object) - The check column as JSON: {"rubric": "...", "model": "openai:gpt-4o"} - An additional CSV column named 'model'

Source code in src/ragpill/evaluators.py
@classmethod
def from_csv_line(
    cls,
    expected: bool,
    tags: set[str],
    check: str,
    get_llm: Callable[[], models.Model] = _get_default_judge_llm,
    **kwargs: Any,
) -> "LLMJudge":
    """Create an LLMJudge from a CSV line.

    This method is used by the CSV testset loader to instantiate the evaluator.
    See [`load_testset`][ragpill.csv.testset.load_testset] for more details.

    For LLMJudge, the check parameter is treated as the rubric text.
    If check is a JSON object with a 'rubric' key, that value is used.
    Otherwise, the entire check string is used as the rubric.

    Args:
        expected: Expected evaluation result
        tags: Comma-separated tags string
        check: Rubric text or JSON with 'rubric' key
        get_llm: Callable that returns a Model instance (defaults to get_default_judge_llm)
        **kwargs: Additional parameters (can include 'model' to override default)

    Note: The model parameter must be provided. It should come from:
    - Dependency injection (e.g., a module-level or class-level settings object)
    - The check column as JSON: {"rubric": "...", "model": "openai:gpt-4o"}
    - An additional CSV column named 'model'
    """

    if not check:
        raise ValueError("LLMJudge requires a non-empty 'check' parameter for the rubric.")
    rubric: str = check
    try:
        check_obj: Any = json.loads(check)
        if isinstance(check_obj, dict):
            rubric = str(check_obj.pop("rubric", check))  # pyright: ignore[reportUnknownMemberType,reportUnknownArgumentType]
            kwargs.update(check_obj)  # pyright: ignore[reportUnknownArgumentType]
    except json.JSONDecodeError:
        # Plain text - use as rubric
        pass
    model = kwargs.pop("model", None) or get_llm()

    return cls(
        rubric=rubric,
        model=model,
        expected=expected,
        tags=tags,
        attributes=kwargs,
    )

RegexInSourcesEvaluator

ragpill.evaluators.RegexInSourcesEvaluator dataclass

RegexInSourcesEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, _mlflow_settings=None, _mlflow_experiment_id=None, _mlflow_run_id=None, inputs_to_key_function=default_input_to_key, evaluation_function, custom_reason_true='Evaluation function returned True.', custom_reason_false='Evaluation function returned False.', pattern)

Bases: SourcesBaseEvaluator

Evaluator to check if a regex pattern is found in any of the source document's content. The documents are retrieved from mlflow trace and include documents from retriever, tool, and reranker spans.

Both the pattern and document contents are normalized before matching via _normalize_text, which applies:

  • Case-folding - all text is lowercased (str.casefold), so matching is always case-insensitive. Using the (?i) flag is therefore redundant.
  • Unicode NFKC - compatibility characters are unified (e.g. UF₆UF6).
  • Whitespace collapsing - runs of whitespace become a single space.
  • Quote normalization - curly quotes, guillemets, primes, etc. are replaced with a straight single quote '.
  • Markdown subscript stripping - e.g. UF~6~UF6.
  • Trailing period stripping.

Tip: Use inline regex flags to modify matching behavior:

  • (?s)pattern - Dotall mode (. matches newlines, useful for multi-line content)
  • (?m)pattern - Multiline mode (^ and $ match line boundaries)
  • (?ms)pattern - Combine multiple flags
Example
# In CSV testset:
# check="section 1"  # Already case-insensitive via normalization
# check="(?s)start.*end"  # Match across newlines
# check="(?s)important.*conclusion"  # Dotall for multi-line matching

from_csv_line classmethod

from_csv_line(expected, tags, check, **kwargs)

Create a RegexInSourcesEvaluator from a CSV line.

This method is used by the CSV testset loader to instantiate the evaluator. See load_testset for more details.

Parameters:

Name Type Description Default
expected bool

Expected evaluation result

required
tags set[str]

Comma-separated tags string

required
check str

Regex pattern to search for in document contents

required
**kwargs Any

Additional attributes for the evaluator

{}
Source code in src/ragpill/evaluators.py
@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any) -> "RegexInSourcesEvaluator":
    """Create a RegexInSourcesEvaluator from a CSV line.

    This method is used by the CSV testset loader to instantiate the evaluator.
    See [`load_testset`][ragpill.csv.testset.load_testset] for more details.

    Args:
        expected: Expected evaluation result
        tags: Comma-separated tags string
        check: Regex pattern to search for in document contents
        **kwargs: Additional attributes for the evaluator
    """
    pattern = _normalize_text(check)
    evaluation_function = _regex_in_any_document_content(pattern)
    return cls(
        expected=expected,
        tags=tags,
        evaluation_function=evaluation_function,
        pattern=pattern,
        attributes=kwargs,
        custom_reason_true=f'Regex pattern "{pattern}" found in at least one document content.',
        custom_reason_false=f'Regex pattern "{pattern}" not found in any document content.',
    )

RegexInDocumentMetadataEvaluator

ragpill.evaluators.RegexInDocumentMetadataEvaluator dataclass

RegexInDocumentMetadataEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, _mlflow_settings=None, _mlflow_experiment_id=None, _mlflow_run_id=None, inputs_to_key_function=default_input_to_key, evaluation_function, custom_reason_true='Evaluation function returned True.', custom_reason_false='Evaluation function returned False.', metadata_key, pattern)

Bases: SourcesBaseEvaluator

Evaluator to check if a regex pattern is found in a specific metadata field of any document retrieved from mlflow trace.

The documents are retrieved from mlflow trace and include documents from retriever, tool, and reranker spans.

Note: For creating from csv, requires 'check' to be a JSON string with 'pattern' and 'key' fields. Then checks if any document in the used sources has metadata[key] matching the regex pattern.

Both the pattern and metadata values are normalized before matching via _normalize_text, which applies case-folding (str.casefold), Unicode NFKC, whitespace collapsing, and quote normalization. Because text is already case-folded, the (?i) flag is redundant.

Inline regex flags still work:

  • (?s)pattern - Dotall mode (. matches newlines, useful for multi-line metadata values)
  • (?m)pattern - Multiline mode (^ and $ match line boundaries)
  • (?ms)pattern - Combine multiple flags
Example
# In CSV testset:
# check='{"pattern": "chapter.*", "key": "source"}'  # Already case-insensitive via normalization
# check='{"pattern": "(?s)start.*end", "key": "content"}'  # Match across newlines in 'content' metadata

from_csv_line classmethod

from_csv_line(expected, tags, check, **kwargs)

Create a RegexInDocumentMetadataEvaluator from a CSV line.

This method is used by the CSV testset loader to instantiate the evaluator. See load_testset for more details.

Parameters:

Name Type Description Default
expected bool

Expected evaluation result

required
tags set[str]

Comma-separated tags string

required
check str

json with 2 keys: "pattern" and "key". Regex pattern to search for in document metadata key.

required
**kwargs Any

Additional attributes for the evaluator

{}
Source code in src/ragpill/evaluators.py
@classmethod
def from_csv_line(
    cls, expected: bool, tags: set[str], check: str, **kwargs: Any
) -> "RegexInDocumentMetadataEvaluator":
    """Create a RegexInDocumentMetadataEvaluator from a CSV line.

    This method is used by the CSV testset loader to instantiate the evaluator.
    See [`load_testset`][ragpill.csv.testset.load_testset] for more details.

    Args:
        expected: Expected evaluation result
        tags: Comma-separated tags string
        check: json with 2 keys: "pattern" and "key". Regex pattern to search for in document metadata key.
        **kwargs: Additional attributes for the evaluator
    """
    try:
        check_dict: Any = json.loads(check)
        assert isinstance(check_dict, dict) and "pattern" in check_dict and "key" in check_dict, (
            f"Check must be a JSON object with 'pattern' and 'key'. Got: {check}"
        )
        pattern: str = str(check_dict["pattern"])  # pyright: ignore[reportUnknownArgumentType]
        metadata_key: str = str(check_dict["key"])  # pyright: ignore[reportUnknownArgumentType]
    except json.JSONDecodeError:
        raise ValueError(
            f"RegexInDocumentMetadataEvaluator requires 'check' to be a JSON string with 'pattern' and 'key'. But got: {check}"
        )
    pattern = _normalize_text(pattern)
    evaluation_function = _regex_in_doc_metadata(metadata_key, pattern)
    return cls(
        expected=expected,
        tags=tags,
        evaluation_function=evaluation_function,
        metadata_key=metadata_key,
        pattern=pattern,
        attributes=kwargs,
        custom_reason_true=f'Regex pattern "{pattern}" found in key "{metadata_key}" of at least one document content.',
        custom_reason_false=f'Regex pattern "{pattern}" not found in key "{metadata_key}" of any document content.',
    )

RegexInOutputEvaluator

ragpill.evaluators.RegexInOutputEvaluator dataclass

RegexInOutputEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, pattern)

Bases: BaseEvaluator

Check whether a regex pattern matches the stringified output.

Both the pattern and the output are normalized before matching via _normalize_text, which applies case-folding (str.casefold), Unicode NFKC, whitespace collapsing, and quote normalization. Because text is already case-folded, the (?i) flag is redundant.

CSV usage examples
  • check="error|failure"
  • check='{"pattern": "success"}'

from_csv_line classmethod

from_csv_line(expected, tags, check, **kwargs)

Create a RegexInOutputEvaluator from a CSV line.

Source code in src/ragpill/evaluators.py
@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any) -> "RegexInOutputEvaluator":
    """Create a RegexInOutputEvaluator from a CSV line."""
    if not check or not check.strip():
        raise ValueError("RegexInOutputEvaluator requires a non-empty 'check' pattern.")

    pattern: str = check
    try:
        parsed: dict[str, Any] | str = json.loads(check)
        if isinstance(parsed, dict) and "pattern" in parsed:
            pattern = str(parsed["pattern"])
        elif isinstance(parsed, str):
            pattern = parsed
    except json.JSONDecodeError:
        pass
    pattern = _normalize_text(pattern)
    return cls(
        pattern=pattern,
        expected=expected,
        tags=tags,
        attributes=kwargs,
    )

LiteralQuoteEvaluator

ragpill.evaluators.LiteralQuoteEvaluator dataclass

LiteralQuoteEvaluator(expected=True, tags=None, attributes=None, **kwargs)

Bases: SourcesBaseEvaluator

Verify that all markdown quotes in the output appear literally in source documents.

This evaluator ensures citations are accurate by checking that any text quoted in markdown blockquotes (lines starting with >) actually appears in the retrieved source documents. This is particularly valuable for RAG systems where accuracy of quoted material is critical.

The evaluator:

  1. Extracts all markdown blockquotes (lines starting with >) from the output
  2. Cleans quotes by removing quotation marks and normalizing whitespace
  3. Verifies each quote appears literally (ignoring whitespace) in source documents
  4. Reports any missing quotes with their referenced filenames when available

Only lines starting with > (after leading whitespace) are considered markdown quotes. Regular quoted text like "this" or 'this' is ignored.

Parameters:

Name Type Description Default
expected bool

Expected evaluation result (default: True)

True
tags set[str] | None

Set of tags for categorizing this evaluator

None
attributes dict[str, Any] | None

Additional attributes for the evaluator

None
Example
from ragpill.evaluators import LiteralQuoteEvaluator

# Create evaluator
evaluator = LiteralQuoteEvaluator(
    expected=True,
    tags={"quotation", "accuracy"}
)

# Output with markdown quote
output = '''
The report states:
> "'no longer outstanding at this stage' does not mean 'resolved'."
(File: [report.txt](link), Paragraph: 38)
'''

# The evaluator will verify this quote exists in the source documents
Markdown Quote Format

The evaluator recognizes standard markdown blockquotes:

> This is a single-line quote

> This is a multi-line quote
> that continues on the next line

> Quote with file reference
(File: [document.txt](link), Paragraph: 5)
Note
  • Whitespace differences between quotes and source text are ignored
  • Quotation marks (", ', ', ', ", ") are stripped before comparison
  • File references in format (File: [filename](...)) are extracted and included in error messages
  • Empty quotes (after cleaning) are skipped
  • Quotes must appear literally in source documents (no fuzzy matching)
See Also

SourcesBaseEvaluator: Base class that retrieves source documents from MLflow traces RegexInSourcesEvaluator: Similar evaluator using regex patterns instead of literal quotes

Source code in src/ragpill/evaluators.py
def __init__(
    self,
    expected: bool = True,
    tags: set[str] | None = None,
    attributes: dict[str, Any] | None = None,
    **kwargs: Any,
):
    super().__init__(
        evaluation_function=lambda docs: True,  # Placeholder, actual logic is in run() for access to output
        expected=expected,
        tags=tags or set(),
        attributes=attributes or {},
        custom_reason_true="All quotes found in source documents.",
        custom_reason_false="",  # Will be set dynamically
        **kwargs,
    )

from_csv_line classmethod

from_csv_line(expected, tags, check, **kwargs)

Create a LiteralQuoteEvaluator from a CSV line.

This method is used by the CSV testset loader to instantiate the evaluator. See load_testset for more details.

Parameters:

Name Type Description Default
expected bool

Expected evaluation result

required
tags set[str]

Comma-separated tags string

required
check str

Not used for this evaluator (can be empty)

required
**kwargs Any

Additional attributes for the evaluator

{}
Source code in src/ragpill/evaluators.py
@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any) -> "LiteralQuoteEvaluator":
    """Create a LiteralQuoteEvaluator from a CSV line.

    This method is used by the CSV testset loader to instantiate the evaluator.
    See [`load_testset`][ragpill.csv.testset.load_testset] for more details.

    Args:
        expected: Expected evaluation result
        tags: Comma-separated tags string
        check: Not used for this evaluator (can be empty)
        **kwargs: Additional attributes for the evaluator
    """
    return cls(
        expected=expected,
        tags=tags,
        attributes=kwargs,
    )

run async

run(ctx)

Override run to have access to both output and documents.

Source code in src/ragpill/evaluators.py
async def run(
    self,
    ctx: EvaluatorContext[Any, Any, EvaluatorMetadata],
) -> EvaluationReason:
    """Override run to have access to both output and documents."""
    documents = self.get_documents(ctx.inputs)
    output_str = str(ctx.output)

    # Extract normalized quotes from output
    quotes = _extract_markdown_quotes(output_str)

    if not quotes:
        return EvaluationReason(
            value=True,
            reason="No quotes found in output.",
        )

    # Normalize all document contents
    normalized_docs = [_normalize_text(doc.page_content) for doc in documents]

    # Check each quote
    not_found: list[str] = []
    for quote, referenced_file in quotes:
        # Check if quote appears in any document
        # Use regex search if quote contains .* (from ellipsis conversion), otherwise use substring match
        if ".*" in quote:
            pattern = re.escape(quote).replace(r"\.\*", ".*")
            found = any(re.search(pattern, doc_content) for doc_content in normalized_docs)
        else:
            found = any(quote in doc_content for doc_content in normalized_docs)

        if not found:
            if referenced_file:
                not_found.append(f'"{quote}" (Referenced file: {referenced_file})')
            else:
                not_found.append(f'"{quote}"')

    if not_found:
        reason = f"Quotes not found in sources: {'; '.join(not_found)}"
        return EvaluationReason(
            value=False,
            reason=reason,
        )

    return EvaluationReason(
        value=True,
        reason=f"All {len(quotes)} quote(s) found in source documents.",
    )

HasQuotesEvaluator

ragpill.evaluators.HasQuotesEvaluator dataclass

HasQuotesEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, min_quotes=1, max_quotes=-1)

Bases: BaseEvaluator

Check if the output contains a minimum (and optionally maximum) number of markdown quotes.

This evaluator verifies that the output includes at least a specified number of markdown blockquotes (lines starting with >). Useful for ensuring responses include citations, evidence, or quoted material.

Only lines starting with > (after leading whitespace) are considered markdown quotes. Regular quoted text like "this" or 'this' is ignored.

Parameters:

Name Type Description Default
min_quotes int

Minimum number of quotes required (default: 1)

1
max_quotes int

Maximum number of quotes allowed (default: -1, meaning no maximum)

-1
expected bool | None

Expected evaluation result (default: True)

None
tags set[str]

Set of tags for categorizing this evaluator

set()
attributes dict[str, Any]

Additional attributes for the evaluator

dict()
Example
from ragpill.evaluators import HasQuotesEvaluator

# Require at least 2 quotes
evaluator = HasQuotesEvaluator(
    min_quotes=2,
    expected=True,
    tags={"quotation", "format"}
)

# Require between 2 and 5 quotes
evaluator = HasQuotesEvaluator(
    min_quotes=2,
    max_quotes=5,
    expected=True,
    tags={"quotation", "format"}
)

# This output has 2 quotes and will pass
output = '''
The report states two key points:
> "First important point."

And also:
> "Second important point."
'''
Note
  • Multi-line quotes (consecutive lines with >) are counted as one quote
  • Empty quotes (only whitespace after >) are not counted
  • The evaluator passes if min_quotes <= num_quotes <= max_quotes (or no max if max_quotes=-1)
  • Set expected=False to verify that quotes are NOT within the specified range
See Also

LiteralQuoteEvaluator: Verifies quotes appear literally in source documents BaseEvaluator: Base class for all evaluators

from_csv_line classmethod

from_csv_line(expected, tags, check, **kwargs)

Create a HasQuotesEvaluator from a CSV line.

This method is used by the CSV testset loader to instantiate the evaluator. See load_testset for more details.

Parameters:

Name Type Description Default
expected bool

Expected evaluation result

required
tags set[str]

Comma-separated tags string

required
check str

Either an integer for min_quotes, or JSON with 'min_quotes' and optionally 'max_quotes'. If empty, defaults to min_quotes=1, max_quotes=-1.

required
**kwargs Any

Additional attributes for the evaluator

{}
Example

In CSV, use check="3" to require at least 3 quotes. Or use check='{"min_quotes": 2, "max_quotes": 5}' to require 2-5 quotes.

Source code in src/ragpill/evaluators.py
@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any) -> "HasQuotesEvaluator":
    """Create a HasQuotesEvaluator from a CSV line.

    This method is used by the CSV testset loader to instantiate the evaluator.
    See [`load_testset`][ragpill.csv.testset.load_testset] for more details.

    Args:
        expected: Expected evaluation result
        tags: Comma-separated tags string
        check: Either an integer for min_quotes, or JSON with 'min_quotes' and optionally 'max_quotes'.
               If empty, defaults to min_quotes=1, max_quotes=-1.
        **kwargs: Additional attributes for the evaluator

    Example:
        In CSV, use check="3" to require at least 3 quotes.
        Or use check='{"min_quotes": 2, "max_quotes": 5}' to require 2-5 quotes.
    """
    min_quotes = 1  # default
    max_quotes = -1  # default (no maximum)

    if check and check.strip():
        # Try parsing as JSON first
        try:
            check_parsed: Any = json.loads(check)
            if isinstance(check_parsed, dict):
                min_quotes = int(check_parsed.get("min_quotes", min_quotes))  # pyright: ignore[reportUnknownMemberType,reportUnknownArgumentType]
                max_quotes = int(check_parsed.get("max_quotes", max_quotes))  # pyright: ignore[reportUnknownMemberType,reportUnknownArgumentType]
            elif isinstance(check_parsed, (int, float)):
                min_quotes = int(check_parsed)
            else:
                raise ValueError("JSON must be an object or number")

            if min_quotes < 0:
                raise ValueError(f"min_quotes must be non-negative, got {min_quotes}")
            if max_quotes != -1 and max_quotes < min_quotes:
                raise ValueError(f"max_quotes ({max_quotes}) must be >= min_quotes ({min_quotes}) or -1")
        except json.JSONDecodeError:
            # Not JSON, treat as integer for min_quotes
            try:
                min_quotes = int(check)
                if min_quotes < 0:
                    raise ValueError(f"min_quotes must be non-negative, got {min_quotes}")
            except ValueError as e:
                raise ValueError(
                    f"HasQuotesEvaluator 'check' parameter must be a non-negative integer or JSON object. Got: {check}"
                ) from e

    return cls(
        min_quotes=min_quotes,
        max_quotes=max_quotes,
        expected=expected,
        tags=tags,
        attributes=kwargs,
    )

run async

run(ctx)

Check if output contains the required number of quotes (within min/max bounds).

Source code in src/ragpill/evaluators.py
async def run(
    self,
    ctx: EvaluatorContext[object, object, EvaluatorMetadata],
) -> EvaluationReason:
    """Check if output contains the required number of quotes (within min/max bounds)."""
    output_str = str(ctx.output)

    # Extract quotes from output
    quotes = self._extract_quotes_from_output(output_str)
    num_quotes = len(quotes)

    # Check if we have enough quotes
    has_min = num_quotes >= self.min_quotes
    has_max = self.max_quotes == -1 or num_quotes <= self.max_quotes
    passes = has_min and has_max

    # Build reason message
    if passes:
        if self.max_quotes == -1:
            reason = f"Found {num_quotes} quote(s) in output (minimum required: {self.min_quotes})."
        else:
            reason = f"Found {num_quotes} quote(s) in output (required range: {self.min_quotes}-{self.max_quotes})."
    else:
        if not has_min:
            reason = f"Found only {num_quotes} quote(s) in output, but {self.min_quotes} required."
        else:  # not has_max
            reason = f"Found {num_quotes} quote(s) in output, but maximum allowed is {self.max_quotes}."

    return EvaluationReason(
        value=passes,
        reason=reason,
    )

Base Evaluators

These are Evaluators that are useful to inherit from. See Create custom evaluators

WrappedPydanticEvaluator

ragpill.evaluators.WrappedPydanticEvaluator dataclass

WrappedPydanticEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, pydantic_evaluator)

Bases: BaseEvaluator

Wrapper to use any pydantic-evals Evaluator as a ragpill BaseEvaluator. See https://ai.pydantic.dev/evals/evaluators/overview/ for a list. Limitation: Span-Based evaluators are not supported as logfire is not supported in ragpill yet.

Note: If you want to use pydantic-evals evaluators in your csv-defined testsets, you need to define a subclass of this class that implements from_csv_line to create the specific pydantic evaluator.

Attributes:

Name Type Description
pydantic_evaluator Evaluator

The pydantic-evals Evaluator instance to wrap.

Example

```python from pydantic_evals.evaluators import SomePydanticEvaluator from ragpill.base import WrappedPydanticEvaluator ragpill_evaluator = WrappedPydanticEvaluator( pydantic_evaluator=SomePydanticEvaluator(...), expected=True, tags={"tag1", "tag2"}, attributes={"attr1": "value1"}, )

SpanBaseEvaluator

ragpill.evaluators.SpanBaseEvaluator dataclass

SpanBaseEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, _mlflow_settings=None, _mlflow_experiment_id=None, _mlflow_run_id=None, inputs_to_key_function=default_input_to_key)

Bases: BaseEvaluator

This base class that retrieves the spans from mlflow trace. This allows subclasses to implement evaluation logic based on spans. Why is this useful? See https://ai.pydantic.dev/evals/evaluators/span-based/

Why Span-Based Evaluation? Traditional evaluators assess task inputs and outputs. For simple tasks, this may be sufficient—if the output is correct, the task succeeded. But for complex multi-step agents, the process matters as much as the result:

A correct answer reached incorrectly - An agent might produce the right output by accident (e.g., guessing, using cached data when it should have searched, calling the wrong tools but getting lucky) Verification of required behaviors - You need to ensure specific tools were called, certain code paths executed, or particular patterns followed Performance and efficiency - The agent should reach the answer efficiently, without unnecessary tool calls, infinite loops, or excessive retries Safety and compliance - Critical to verify that dangerous operations weren't attempted, sensitive data wasn't accessed inappropriately, or guardrails weren't bypassed

Real-World Scenarios Span-based evaluation is particularly valuable for:

RAG systems - Verify documents were retrieved and reranked before generation, not just that the answer included citations Multi-agent coordination - Ensure the orchestrator delegated to the right specialist agents in the correct order Tool-calling agents - Confirm specific tools were used (or avoided), and in the expected sequence Debugging and regression testing - Catch behavioral regressions where outputs remain correct but the internal logic deteriorates Production alignment - Ensure your evaluation assertions operate on the same telemetry data captured in production, so eval insights directly translate to production monitoring

How It Works When tracing the mlflow experiment, a hash of the input is stored as a span attribute (input_key). The evaluator uses this to find the trace for the given input of the running experiment.

Which tools were called - HasMatchingSpan(query={'name_contains': 'search_tool'}) Code paths executed - Verify specific functions ran or particular branches taken Timing characteristics - Check that operations complete within SLA bounds Error conditions - Detect retries, fallbacks, or specific failure modes Execution structure - Verify parent-child relationships, delegation patterns, or execution order This creates a fundamentally different evaluation paradigm: you're testing behavioral contracts, not just input-output relationships.

SourcesBaseEvaluator

ragpill.evaluators.SourcesBaseEvaluator dataclass

SourcesBaseEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, _mlflow_settings=None, _mlflow_experiment_id=None, _mlflow_run_id=None, inputs_to_key_function=default_input_to_key, evaluation_function, custom_reason_true='Evaluation function returned True.', custom_reason_false='Evaluation function returned False.')

Bases: SpanBaseEvaluator

This base class that retrieves the sources from mlflow trace.

Note: only documents retrieved from a retriever, reranker or tool span are considered as sources.

See Also