Evaluators¶

This module provides pre-built evaluators for common evaluation tasks.

As the task can have arbitrary input and output types these evaluators in general coerce those to strings and use string-based evaluation methods (LLM judges, regex checks, etc).

It is, however, pretty easy to create your own custom evaluators by inheriting from BaseEvaluator and implementing the run() method. See Create custom evaluators for a tutorial on how to do this.

LLMJudge¶

ragpill.evaluators.LLMJudge `dataclass` ¶

LLMJudge(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, rubric, model=_get_default_judge_llm(), include_input=False)

Bases: BaseEvaluator

The LLMJudge evaluator uses a language model to judge whether an output meets specified rubric.

A rubric usually is one of the following: - A fact that the output should contain or not contain (rubric="Output must contain the fact that Paris is the capital of France.") - About the style of the output (rubric="Output should be in a formal tone." or "Output should be in German")

Note: Avoid complex instructions in the rubric, as the model may not follow them reliably. Instead, try to break it down into multiple instances of the LLMJudge.

metadata `property` ¶

metadata

Build metadata from evaluator fields.

the default in BaseEvaluator is overridden because excluding the not pickleable model field seems impossible

from_csv_line `classmethod` ¶

from_csv_line(expected, tags, check, get_llm=_get_default_judge_llm, **kwargs)

Create an LLMJudge from a CSV line.

This method is used by the CSV testset loader to instantiate the evaluator. See load_testset for more details.

For LLMJudge, the check parameter is treated as the rubric text. If check is a JSON object with a 'rubric' key, that value is used. Otherwise, the entire check string is used as the rubric.

Parameters:

Name	Type	Description	Default
`expected`	`bool`	Expected evaluation result	required
`tags`	`set[str]`	Comma-separated tags string	required
`check`	`str`	Rubric text or JSON with 'rubric' key	required
`get_llm`	`Callable[[], Model]`	Callable that returns a Model instance (defaults to get_default_judge_llm)	`_get_default_judge_llm`
`**kwargs`	`Any`	Additional parameters (can include 'model' to override default)	`{}`

Note: The model parameter must be provided. It should come from: - Dependency injection (e.g., a module-level or class-level settings object) - The check column as JSON: {"rubric": "...", "model": "openai:gpt-4o"} - An additional CSV column named 'model'

Source code in src/ragpill/evaluators.py

@classmethod
def from_csv_line(
    cls,
    expected: bool,
    tags: set[str],
    check: str,
    get_llm: Callable[[], models.Model] = _get_default_judge_llm,
    **kwargs: Any,
) -> "LLMJudge":
    """Create an LLMJudge from a CSV line.

    This method is used by the CSV testset loader to instantiate the evaluator.
    See [`load_testset`][ragpill.csv.testset.load_testset] for more details.

    For LLMJudge, the check parameter is treated as the rubric text.
    If check is a JSON object with a 'rubric' key, that value is used.
    Otherwise, the entire check string is used as the rubric.

    Args:
        expected: Expected evaluation result
        tags: Comma-separated tags string
        check: Rubric text or JSON with 'rubric' key
        get_llm: Callable that returns a Model instance (defaults to get_default_judge_llm)
        **kwargs: Additional parameters (can include 'model' to override default)

    Note: The model parameter must be provided. It should come from:
    - Dependency injection (e.g., a module-level or class-level settings object)
    - The check column as JSON: {"rubric": "...", "model": "openai:gpt-4o"}
    - An additional CSV column named 'model'
    """

    if not check:
        raise ValueError("LLMJudge requires a non-empty 'check' parameter for the rubric.")
    rubric: str = check
    try:
        check_obj: Any = json.loads(check)
        if isinstance(check_obj, dict):
            rubric = str(check_obj.pop("rubric", check))  # pyright: ignore[reportUnknownMemberType,reportUnknownArgumentType]
            kwargs.update(check_obj)  # pyright: ignore[reportUnknownArgumentType]
    except json.JSONDecodeError:
        # Plain text - use as rubric
        pass
    model = kwargs.pop("model", None) or get_llm()

    return cls(
        rubric=rubric,
        model=model,
        expected=expected,
        tags=tags,
        attributes=kwargs,
    )

run `async` ¶

run(ctx)

Evaluate the output against the rubric using an LLM judge.

Parameters:

Name	Type	Description	Default
`ctx`	`EvaluatorContext[object, object, EvaluatorMetadata]`	The evaluator context containing inputs, output, and metadata.	required

Returns:

Type	Description
`EvaluationReason`	The evaluation result with the judge's reasoning.

Source code in src/ragpill/evaluators.py

async def run(
    self,
    ctx: EvaluatorContext[object, object, EvaluatorMetadata],
) -> EvaluationReason:
    """Evaluate the output against the rubric using an LLM judge.

    Args:
        ctx: The evaluator context containing inputs, output, and metadata.

    Returns:
        The evaluation result with the judge's reasoning.
    """
    # Wrap in an explicit span so that both the pydantic-ai and openai autolog
    # integrations create child spans rather than competing root traces. Without this,
    # the two integrations race to INSERT a root trace with the same request_id, which
    # causes a UNIQUE constraint violation in MLflow's SQLite backend.
    # The "ragpill_is_judge_trace" attribute lets _delete_llm_judge_traces identify
    # and remove these traces after evaluation.
    with mlflow.start_span(name="llm-judge-evaluation", span_type=SpanType.LLM) as span:
        span.set_attribute("ragpill_is_judge_trace", True)
        if self.include_input:
            grading_output = await judge_input_output(ctx.inputs, ctx.output, self.rubric, self.model)
        else:
            grading_output = await judge_output(ctx.output, self.rubric, self.model)
        span.set_outputs({"pass": grading_output.pass_, "reason": grading_output.reason})
    # Embed the rubric verbatim alongside the judge's reasoning so anyone
    # reading the assessment in the MLflow UI (or the triage markdown, or
    # the runs DataFrame) can see what "passing" was supposed to mean
    # without context-switching to the testset. Applied unconditionally —
    # downstream code shouldn't have to special-case pass vs fail.
    return EvaluationReason(
        value=grading_output.pass_,
        reason=f"Rubric: {self.rubric}\nVerdict: {grading_output.reason}",
    )

RegexInSourcesEvaluator¶

ragpill.evaluators.RegexInSourcesEvaluator `dataclass` ¶

RegexInSourcesEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, evaluation_function, custom_reason_true='Evaluation function returned True.', custom_reason_false='Evaluation function returned False.', pattern)

Bases: SourcesBaseEvaluator

Evaluator to check if a regex pattern is found in any of the source document's content. The documents are retrieved from mlflow trace and include documents from retriever, tool, and reranker spans.

Both the pattern and document contents are normalized before matching via _normalize_text, which applies:

Case-folding - all text is lowercased (str.casefold), so matching is always case-insensitive. Using the (?i) flag is therefore redundant.
Unicode NFKC - compatibility characters are unified (e.g. UF₆ ↔ UF6).
Whitespace collapsing - runs of whitespace become a single space.
Quote normalization - curly quotes, guillemets, primes, etc. are replaced with a straight single quote '.
Markdown subscript stripping - e.g. UF~6~ → UF6.
Trailing period stripping.

Tip: Use inline regex flags to modify matching behavior:

(?s)pattern - Dotall mode (. matches newlines, useful for multi-line content)
(?m)pattern - Multiline mode (^ and $ match line boundaries)
(?ms)pattern - Combine multiple flags

Example

# In CSV testset:
# check="section 1"  # Already case-insensitive via normalization
# check="(?s)start.*end"  # Match across newlines
# check="(?s)important.*conclusion"  # Dotall for multi-line matching

from_csv_line `classmethod` ¶

from_csv_line(expected, tags, check, **kwargs)

Create a RegexInSourcesEvaluator from a CSV line.

This method is used by the CSV testset loader to instantiate the evaluator. See load_testset for more details.

Parameters:

Name	Type	Description	Default
`expected`	`bool`	Expected evaluation result	required
`tags`	`set[str]`	Comma-separated tags string	required
`check`	`str`	Regex pattern to search for in document contents	required
`**kwargs`	`Any`	Additional attributes for the evaluator	`{}`

Source code in src/ragpill/evaluators.py

@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any) -> "RegexInSourcesEvaluator":
    """Create a RegexInSourcesEvaluator from a CSV line.

    This method is used by the CSV testset loader to instantiate the evaluator.
    See [`load_testset`][ragpill.csv.testset.load_testset] for more details.

    Args:
        expected: Expected evaluation result
        tags: Comma-separated tags string
        check: Regex pattern to search for in document contents
        **kwargs: Additional attributes for the evaluator
    """
    pattern = _normalize_text(check)
    evaluation_function = _regex_in_any_document_content(pattern)
    return cls(
        expected=expected,
        tags=tags,
        evaluation_function=evaluation_function,
        pattern=pattern,
        attributes=kwargs,
        custom_reason_true=f'Regex pattern "{pattern}" found in at least one document content.',
        custom_reason_false=f'Regex pattern "{pattern}" not found in any document content.',
    )

RegexInDocumentMetadataEvaluator¶

ragpill.evaluators.RegexInDocumentMetadataEvaluator `dataclass` ¶

RegexInDocumentMetadataEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, evaluation_function, custom_reason_true='Evaluation function returned True.', custom_reason_false='Evaluation function returned False.', metadata_key, pattern)

Bases: SourcesBaseEvaluator

Evaluator to check if a regex pattern is found in a specific metadata field of any document retrieved from mlflow trace.

The documents are retrieved from mlflow trace and include documents from retriever, tool, and reranker spans.

Note: For creating from csv, requires 'check' to be a JSON string with 'pattern' and 'key' fields. Then checks if any document in the used sources has metadata[key] matching the regex pattern.

Both the pattern and metadata values are normalized before matching via _normalize_text, which applies case-folding (str.casefold), Unicode NFKC, whitespace collapsing, and quote normalization. Because text is already case-folded, the (?i) flag is redundant.

Inline regex flags still work:

(?s)pattern - Dotall mode (. matches newlines, useful for multi-line metadata values)
(?m)pattern - Multiline mode (^ and $ match line boundaries)
(?ms)pattern - Combine multiple flags

Example

# In CSV testset:
# check='{"pattern": "chapter.*", "key": "source"}'  # Already case-insensitive via normalization
# check='{"pattern": "(?s)start.*end", "key": "content"}'  # Match across newlines in 'content' metadata

from_csv_line `classmethod` ¶

from_csv_line(expected, tags, check, **kwargs)

Create a RegexInDocumentMetadataEvaluator from a CSV line.

This method is used by the CSV testset loader to instantiate the evaluator. See load_testset for more details.

Parameters:

Name	Type	Description	Default
`expected`	`bool`	Expected evaluation result	required
`tags`	`set[str]`	Comma-separated tags string	required
`check`	`str`	json with 2 keys: "pattern" and "key". Regex pattern to search for in document metadata key.	required
`**kwargs`	`Any`	Additional attributes for the evaluator	`{}`

Source code in src/ragpill/evaluators.py

@classmethod
def from_csv_line(
    cls, expected: bool, tags: set[str], check: str, **kwargs: Any
) -> "RegexInDocumentMetadataEvaluator":
    """Create a RegexInDocumentMetadataEvaluator from a CSV line.

    This method is used by the CSV testset loader to instantiate the evaluator.
    See [`load_testset`][ragpill.csv.testset.load_testset] for more details.

    Args:
        expected: Expected evaluation result
        tags: Comma-separated tags string
        check: json with 2 keys: "pattern" and "key". Regex pattern to search for in document metadata key.
        **kwargs: Additional attributes for the evaluator
    """
    try:
        check_dict: Any = json.loads(check)
        assert isinstance(check_dict, dict) and "pattern" in check_dict and "key" in check_dict, (
            f"Check must be a JSON object with 'pattern' and 'key'. Got: {check}"
        )
        pattern: str = str(check_dict["pattern"])  # pyright: ignore[reportUnknownArgumentType]
        metadata_key: str = str(check_dict["key"])  # pyright: ignore[reportUnknownArgumentType]
    except json.JSONDecodeError:
        raise ValueError(
            f"RegexInDocumentMetadataEvaluator requires 'check' to be a JSON string with 'pattern' and 'key'. But got: {check}"
        )
    pattern = _normalize_text(pattern)
    evaluation_function = _regex_in_doc_metadata(metadata_key, pattern)
    return cls(
        expected=expected,
        tags=tags,
        evaluation_function=evaluation_function,
        metadata_key=metadata_key,
        pattern=pattern,
        attributes=kwargs,
        custom_reason_true=f'Regex pattern "{pattern}" found in key "{metadata_key}" of at least one document content.',
        custom_reason_false=f'Regex pattern "{pattern}" not found in key "{metadata_key}" of any document content.',
    )

RegexInOutputEvaluator¶

ragpill.evaluators.RegexInOutputEvaluator `dataclass` ¶

RegexInOutputEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, pattern)

Bases: BaseEvaluator

Check whether a regex pattern matches the stringified output.

Both the pattern and the output are normalized before matching via _normalize_text, which applies case-folding (str.casefold), Unicode NFKC, whitespace collapsing, and quote normalization. Because text is already case-folded, the (?i) flag is redundant.

CSV usage examples

check="error|failure"
check='{"pattern": "success"}'

from_csv_line `classmethod` ¶

from_csv_line(expected, tags, check, **kwargs)

Create a RegexInOutputEvaluator from a CSV line.

Source code in src/ragpill/evaluators.py

@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any) -> "RegexInOutputEvaluator":
    """Create a RegexInOutputEvaluator from a CSV line."""
    if not check or not check.strip():
        raise ValueError("RegexInOutputEvaluator requires a non-empty 'check' pattern.")

    pattern: str = check
    try:
        parsed: dict[str, Any] | str = json.loads(check)
        if isinstance(parsed, dict) and "pattern" in parsed:
            pattern = str(parsed["pattern"])
        elif isinstance(parsed, str):
            pattern = parsed
    except json.JSONDecodeError:
        pass
    pattern = _normalize_text(pattern)
    return cls(
        pattern=pattern,
        expected=expected,
        tags=tags,
        attributes=kwargs,
    )

run `async` ¶

run(ctx)

Check whether the regex pattern matches the normalized task output.

Parameters:

Name	Type	Description	Default
`ctx`	`EvaluatorContext[object, object, EvaluatorMetadata]`	The evaluator context containing inputs, output, and metadata.	required

Returns:

Type	Description
`EvaluationReason`	The evaluation result indicating whether the pattern matched.

Source code in src/ragpill/evaluators.py

async def run(
    self,
    ctx: EvaluatorContext[object, object, EvaluatorMetadata],
) -> EvaluationReason:
    """Check whether the regex pattern matches the normalized task output.

    Args:
        ctx: The evaluator context containing inputs, output, and metadata.

    Returns:
        The evaluation result indicating whether the pattern matched.
    """
    output_str = _normalize_text(str(ctx.output))
    matches = bool(self._compiled_pattern.search(output_str))
    reason = (
        f'Regex pattern "{self.pattern}" matched output.'
        if matches
        else f'Regex pattern "{self.pattern}" did not match output.'
    )
    return EvaluationReason(
        value=matches,
        reason=reason,
    )

LiteralQuoteEvaluator¶

ragpill.evaluators.LiteralQuoteEvaluator `dataclass` ¶

LiteralQuoteEvaluator(expected=True, tags=None, attributes=None, **kwargs)

Bases: SourcesBaseEvaluator

Verify that all markdown quotes in the output appear literally in source documents.

This evaluator ensures citations are accurate by checking that any text quoted in markdown blockquotes (lines starting with >) actually appears in the retrieved source documents. This is particularly valuable for RAG systems where accuracy of quoted material is critical.

The evaluator:

Extracts all markdown blockquotes (lines starting with >) from the output
Cleans quotes by removing quotation marks and normalizing whitespace
Verifies each quote appears literally (ignoring whitespace) in source documents
Reports any missing quotes with their referenced filenames when available

Only lines starting with > (after leading whitespace) are considered markdown quotes. Regular quoted text like "this" or 'this' is ignored.

Parameters:

Name	Type	Description	Default
`expected`	`bool`	Expected evaluation result (default: True)	`True`
`tags`	`set[str] \| None`	Set of tags for categorizing this evaluator	`None`
`attributes`	`dict[str, Any] \| None`	Additional attributes for the evaluator	`None`

Example

from ragpill.evaluators import LiteralQuoteEvaluator

# Create evaluator
evaluator = LiteralQuoteEvaluator(
    expected=True,
    tags={"quotation", "accuracy"}
)

# Output with markdown quote
output = '''
The report states:
> "'no longer outstanding at this stage' does not mean 'resolved'."
(File: [report.txt](link), Paragraph: 38)
'''

# The evaluator will verify this quote exists in the source documents

Markdown Quote Format

The evaluator recognizes standard markdown blockquotes:

> This is a single-line quote

> This is a multi-line quote
> that continues on the next line

> Quote with file reference
(File: [document.txt](link), Paragraph: 5)

Note

Whitespace differences between quotes and source text are ignored
Quotation marks (", ', ', ', ", ") are stripped before comparison
File references in format (File: [filename](...)) are extracted and included in error messages
Empty quotes (after cleaning) are skipped
Quotes must appear literally in source documents (no fuzzy matching)

from_csv_line `classmethod` ¶

from_csv_line(expected, tags, check, **kwargs)

Create a LiteralQuoteEvaluator from a CSV line.

This method is used by the CSV testset loader to instantiate the evaluator. See load_testset for more details.

Parameters:

Name	Type	Description	Default
`expected`	`bool`	Expected evaluation result	required
`tags`	`set[str]`	Comma-separated tags string	required
`check`	`str`	Not used for this evaluator (can be empty)	required
`**kwargs`	`Any`	Additional attributes for the evaluator	`{}`

Source code in src/ragpill/evaluators.py

@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any) -> "LiteralQuoteEvaluator":
    """Create a LiteralQuoteEvaluator from a CSV line.

    This method is used by the CSV testset loader to instantiate the evaluator.
    See [`load_testset`][ragpill.csv.testset.load_testset] for more details.

    Args:
        expected: Expected evaluation result
        tags: Comma-separated tags string
        check: Not used for this evaluator (can be empty)
        **kwargs: Additional attributes for the evaluator
    """
    return cls(
        expected=expected,
        tags=tags,
        attributes=kwargs,
    )

run `async` ¶

run(ctx)

Override run to have access to both output and documents.

Source code in src/ragpill/evaluators.py

async def run(
    self,
    ctx: EvaluatorContext[Any, Any, EvaluatorMetadata],
) -> EvaluationReason:
    """Override run to have access to both output and documents."""
    documents = self.get_documents(ctx)
    output_str = str(ctx.output)

    # Extract normalized quotes from output
    quotes = _extract_markdown_quotes(output_str)

    if not quotes:
        return EvaluationReason(
            value=True,
            reason="No quotes found in output.",
        )

    # Zero-retrieval case: distinct error so the failure is attributed to
    # the missing retrieval step, not the quote text.
    if not documents:
        return EvaluationReason(
            value=False,
            reason=(
                f"No documents were retrieved, so the {len(quotes)} quote(s) "
                "in the output cannot be verified. This usually means the agent "
                "skipped retrieval — the quote may or may not be supported by the corpus."
            ),
        )

    # Deduplicate identical quotes so the failure message doesn't repeat them.
    unique_quotes: list[tuple[str, str | None]] = []
    seen: set[str] = set()
    for quote, ref in quotes:
        if quote not in seen:
            seen.add(quote)
            unique_quotes.append((quote, ref))

    # Apply the comparison-only aggressive normalization to both sides so
    # citation noise, LaTeX wrappers, dash variants, and markdown emphasis
    # don't cause spurious mismatches. The lean extraction in
    # _extract_markdown_quotes leaves those features in the quote text
    # so the runs DataFrame still shows the agent's original wording.
    normalized_docs = [_normalize_for_quote_comparison(doc.page_content) for doc in documents]

    # Check each quote
    not_found: list[str] = []
    for quote, referenced_file in unique_quotes:
        compare_quote = _normalize_for_quote_comparison(quote)
        # Use regex search if quote contains .* (from ellipsis conversion), otherwise use substring match
        if ".*" in compare_quote:
            pattern = re.escape(compare_quote).replace(r"\.\*", ".*")
            found = any(re.search(pattern, doc_content) for doc_content in normalized_docs)
        else:
            found = any(compare_quote in doc_content for doc_content in normalized_docs)

        if not found:
            hint = _closest_window_hint(compare_quote, normalized_docs)
            ref_str = f" (Referenced file: {referenced_file})" if referenced_file else ""
            not_found.append(f'"{quote}"{ref_str}{hint}')

    if not_found:
        reason = f"Quotes not found in sources: {'; '.join(not_found)}"
        return EvaluationReason(
            value=False,
            reason=reason,
        )

    return EvaluationReason(
        value=True,
        reason=f"All {len(quotes)} quote(s) found in source documents.",
    )

HasQuotesEvaluator¶

ragpill.evaluators.HasQuotesEvaluator `dataclass` ¶

HasQuotesEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, min_quotes=1, max_quotes=-1)

Bases: BaseEvaluator

Check if the output contains a minimum (and optionally maximum) number of markdown quotes.

This evaluator verifies that the output includes at least a specified number of markdown blockquotes (lines starting with >). Useful for ensuring responses include citations, evidence, or quoted material.

Only lines starting with > (after leading whitespace) are considered markdown quotes. Regular quoted text like "this" or 'this' is ignored.

Parameters:

Name	Type	Description	Default
`min_quotes`	`int`	Minimum number of quotes required (default: 1)	`1`
`max_quotes`	`int`	Maximum number of quotes allowed (default: -1, meaning no maximum)	`-1`
`expected`	`bool \| None`	Expected evaluation result (default: True)	`None`
`tags`	`set[str]`	Set of tags for categorizing this evaluator	`set()`
`attributes`	`dict[str, Any]`	Additional attributes for the evaluator	`dict()`

Example

from ragpill.evaluators import HasQuotesEvaluator

# Require at least 2 quotes
evaluator = HasQuotesEvaluator(
    min_quotes=2,
    expected=True,
    tags={"quotation", "format"}
)

# Require between 2 and 5 quotes
evaluator = HasQuotesEvaluator(
    min_quotes=2,
    max_quotes=5,
    expected=True,
    tags={"quotation", "format"}
)

# This output has 2 quotes and will pass
output = '''
The report states two key points:
> "First important point."

And also:
> "Second important point."
'''

Note

Multi-line quotes (consecutive lines with >) are counted as one quote
Empty quotes (only whitespace after >) are not counted
The evaluator passes if min_quotes <= num_quotes <= max_quotes (or no max if max_quotes=-1)
Set expected=False to verify that quotes are NOT within the specified range

from_csv_line `classmethod` ¶

from_csv_line(expected, tags, check, **kwargs)

Create a HasQuotesEvaluator from a CSV line.

This method is used by the CSV testset loader to instantiate the evaluator. See load_testset for more details.

Parameters:

Name	Type	Description	Default
`expected`	`bool`	Expected evaluation result	required
`tags`	`set[str]`	Comma-separated tags string	required
`check`	`str`	Either an integer for min_quotes, or JSON with 'min_quotes' and optionally 'max_quotes'. If empty, defaults to min_quotes=1, max_quotes=-1.	required
`**kwargs`	`Any`	Additional attributes for the evaluator	`{}`

Example

In CSV, use check="3" to require at least 3 quotes. Or use check='{"min_quotes": 2, "max_quotes": 5}' to require 2-5 quotes.

Source code in src/ragpill/evaluators.py

@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any) -> "HasQuotesEvaluator":
    """Create a HasQuotesEvaluator from a CSV line.

    This method is used by the CSV testset loader to instantiate the evaluator.
    See [`load_testset`][ragpill.csv.testset.load_testset] for more details.

    Args:
        expected: Expected evaluation result
        tags: Comma-separated tags string
        check: Either an integer for min_quotes, or JSON with 'min_quotes' and optionally 'max_quotes'.
               If empty, defaults to min_quotes=1, max_quotes=-1.
        **kwargs: Additional attributes for the evaluator

    Example:
        In CSV, use check="3" to require at least 3 quotes.
        Or use check='{"min_quotes": 2, "max_quotes": 5}' to require 2-5 quotes.
    """
    min_quotes = 1  # default
    max_quotes = -1  # default (no maximum)

    if check and check.strip():
        # Try parsing as JSON first
        try:
            check_parsed: Any = json.loads(check)
            if isinstance(check_parsed, dict):
                min_quotes = int(check_parsed.get("min_quotes", min_quotes))  # pyright: ignore[reportUnknownMemberType,reportUnknownArgumentType]
                max_quotes = int(check_parsed.get("max_quotes", max_quotes))  # pyright: ignore[reportUnknownMemberType,reportUnknownArgumentType]
            elif isinstance(check_parsed, (int, float)):
                min_quotes = int(check_parsed)
            else:
                raise ValueError("JSON must be an object or number")

            if min_quotes < 0:
                raise ValueError(f"min_quotes must be non-negative, got {min_quotes}")
            if max_quotes != -1 and max_quotes < min_quotes:
                raise ValueError(f"max_quotes ({max_quotes}) must be >= min_quotes ({min_quotes}) or -1")
        except json.JSONDecodeError:
            # Not JSON, treat as integer for min_quotes
            try:
                min_quotes = int(check)
                if min_quotes < 0:
                    raise ValueError(f"min_quotes must be non-negative, got {min_quotes}")
            except ValueError as e:
                raise ValueError(
                    f"HasQuotesEvaluator 'check' parameter must be a non-negative integer or JSON object. Got: {check}"
                ) from e

    return cls(
        min_quotes=min_quotes,
        max_quotes=max_quotes,
        expected=expected,
        tags=tags,
        attributes=kwargs,
    )

run `async` ¶

run(ctx)

Check if output contains the required number of quotes (within min/max bounds).

Source code in src/ragpill/evaluators.py

async def run(
    self,
    ctx: EvaluatorContext[object, object, EvaluatorMetadata],
) -> EvaluationReason:
    """Check if output contains the required number of quotes (within min/max bounds)."""
    output_str = str(ctx.output)

    # Extract quotes from output
    quotes = self._extract_quotes_from_output(output_str)
    num_quotes = len(quotes)

    # Check if we have enough quotes
    has_min = num_quotes >= self.min_quotes
    has_max = self.max_quotes == -1 or num_quotes <= self.max_quotes
    passes = has_min and has_max

    # Build reason message
    if passes:
        if self.max_quotes == -1:
            reason = f"Found {num_quotes} quote(s) in output (minimum required: {self.min_quotes})."
        else:
            reason = f"Found {num_quotes} quote(s) in output (required range: {self.min_quotes}-{self.max_quotes})."
    else:
        if not has_min:
            reason = f"Found only {num_quotes} quote(s) in output, but {self.min_quotes} required."
        else:  # not has_max
            reason = f"Found {num_quotes} quote(s) in output, but maximum allowed is {self.max_quotes}."

    return EvaluationReason(
        value=passes,
        reason=reason,
    )

Base Evaluators¶

These are Evaluators that are useful to inherit from. See Create custom evaluators

SpanBaseEvaluator¶

ragpill.evaluators.SpanBaseEvaluator `dataclass` ¶

SpanBaseEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False)

Bases: BaseEvaluator

Base class for evaluators that inspect the MLflow trace of a run.

Subclasses call :meth:get_trace to obtain a :class:mlflow.entities.Trace scoped to the current run. This is populated by the Phase 1 execute layer and passed through :class:~ragpill.eval_types.EvaluatorContext.

Why Span-Based Evaluation? Traditional evaluators assess task inputs and outputs. For simple tasks, that's sufficient. For complex multi-step agents, the process matters as much as the result — RegexInSourcesEvaluator, for example, needs to look inside retriever/tool spans to verify that certain sources were actually used.

get_trace ¶

get_trace(ctx)

Return the trace associated with the current evaluation context.

Parameters:

Name	Type	Description	Default
`ctx`	`EvaluatorContext[Any, Any, EvaluatorMetadata]`	The evaluator context. `ctx.trace` must be non-None.	required

Returns:

Type	Description
`Trace`	The MLflow `Trace` for this run, filtered to the run's subtree
`Trace`	when `ctx.run_span_id` is set.

Raises:

Type	Description
`ValueError`	If `ctx.trace` is `None`.

Source code in src/ragpill/evaluators.py

def get_trace(self, ctx: EvaluatorContext[Any, Any, EvaluatorMetadata]) -> Trace:
    """Return the trace associated with the current evaluation context.

    Args:
        ctx: The evaluator context. ``ctx.trace`` must be non-None.

    Returns:
        The MLflow ``Trace`` for this run, filtered to the run's subtree
        when ``ctx.run_span_id`` is set.

    Raises:
        ValueError: If ``ctx.trace`` is ``None``.
    """
    if ctx.trace is None:
        raise ValueError(
            "SpanBaseEvaluator.get_trace requires ctx.trace to be populated. "
            "Make sure execute_dataset() was called with capture_traces=True "
            "before running evaluators."
        )
    trace = ctx.trace
    if ctx.run_span_id:
        trace = _filter_trace_to_subtree(trace, ctx.run_span_id)
    return trace

SourcesBaseEvaluator¶

ragpill.evaluators.SourcesBaseEvaluator `dataclass` ¶

SourcesBaseEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, evaluation_function, custom_reason_true='Evaluation function returned True.', custom_reason_false='Evaluation function returned False.')

Bases: SpanBaseEvaluator

This base class that retrieves the sources from mlflow trace.

Note: only documents retrieved from a retriever, reranker or tool span are considered as sources.

get_documents ¶

get_documents(ctx)

Retrieve source documents from the run's MLflow trace.

Parameters:

Name	Type	Description	Default
`ctx`	`EvaluatorContext[Any, Any, EvaluatorMetadata]`	The evaluator context; `ctx.trace` is read via :meth:`SpanBaseEvaluator.get_trace`.	required

Returns:

Type	Description
`list[Document]`	List of documents extracted from retriever, tool, and reranker
`list[Document]`	spans in the trace.

Source code in src/ragpill/evaluators.py

def get_documents(self, ctx: EvaluatorContext[Any, Any, EvaluatorMetadata]) -> list[Document]:
    """Retrieve source documents from the run's MLflow trace.

    Args:
        ctx: The evaluator context; ``ctx.trace`` is read via
            :meth:`SpanBaseEvaluator.get_trace`.

    Returns:
        List of documents extracted from retriever, tool, and reranker
        spans in the trace.
    """
    trace = self.get_trace(ctx)
    retriever_spans = trace.search_spans(span_type=SpanType.RETRIEVER)  # pyright: ignore[reportArgumentType,reportUnknownMemberType]
    tool_spans = trace.search_spans(span_type=SpanType.TOOL)  # pyright: ignore[reportArgumentType,reportUnknownMemberType]
    reranker_spans = trace.search_spans(span_type=SpanType.RERANKER)  # pyright: ignore[reportArgumentType,reportUnknownMemberType]
    all_documents: list[Document] = []
    for span in retriever_spans + tool_spans + reranker_spans:
        if isinstance(span.outputs, list) and len(span.outputs) > 0:  # pyright: ignore[reportUnknownMemberType,reportUnknownArgumentType]
            try:
                docs = [
                    Document(**output)  # pyright: ignore[reportUnknownArgumentType]
                    for output in span.outputs  # pyright: ignore[reportUnknownVariableType,reportUnknownMemberType]
                    if isinstance(output, dict) and "page_content" in output and "metadata" in output
                ]
            except Exception:
                continue
            all_documents.extend(docs)
    return all_documents

run `async` ¶

run(ctx)

Retrieve source documents and apply the evaluation function.

Parameters:

Name	Type	Description	Default
`ctx`	`EvaluatorContext[Any, Any, EvaluatorMetadata]`	The evaluator context containing inputs, output, and metadata.	required

Returns:

Type	Description
`EvaluationReason`	The evaluation result with a custom reason message.

Source code in src/ragpill/evaluators.py

async def run(
    self,
    ctx: EvaluatorContext[Any, Any, EvaluatorMetadata],
) -> EvaluationReason:
    """Retrieve source documents and apply the evaluation function.

    Args:
        ctx: The evaluator context containing inputs, output, and metadata.

    Returns:
        The evaluation result with a custom reason message.
    """
    documents = self.get_documents(ctx)
    result = self.evaluation_function(documents)
    return EvaluationReason(
        value=result,
        reason=self.custom_reason_true if result else self.custom_reason_false,
    )

Evaluators¶

LLMJudge¶

ragpill.evaluators.LLMJudge dataclass ¶

metadata property ¶

from_csv_line classmethod ¶

run async ¶

RegexInSourcesEvaluator¶

ragpill.evaluators.RegexInSourcesEvaluator dataclass ¶

from_csv_line classmethod ¶

RegexInDocumentMetadataEvaluator¶

ragpill.evaluators.RegexInDocumentMetadataEvaluator dataclass ¶

from_csv_line classmethod ¶

RegexInOutputEvaluator¶

ragpill.evaluators.RegexInOutputEvaluator dataclass ¶

from_csv_line classmethod ¶

run async ¶

LiteralQuoteEvaluator¶

ragpill.evaluators.LiteralQuoteEvaluator dataclass ¶

from_csv_line classmethod ¶

run async ¶

HasQuotesEvaluator¶

ragpill.evaluators.HasQuotesEvaluator dataclass ¶

from_csv_line classmethod ¶

run async ¶

Base Evaluators¶

SpanBaseEvaluator¶

ragpill.evaluators.SpanBaseEvaluator dataclass ¶

get_trace ¶

SourcesBaseEvaluator¶

ragpill.evaluators.SourcesBaseEvaluator dataclass ¶

get_documents ¶

run async ¶

See Also¶

ragpill.evaluators.LLMJudge `dataclass` ¶

metadata `property` ¶

from_csv_line `classmethod` ¶

run `async` ¶

ragpill.evaluators.RegexInSourcesEvaluator `dataclass` ¶

from_csv_line `classmethod` ¶

ragpill.evaluators.RegexInDocumentMetadataEvaluator `dataclass` ¶

from_csv_line `classmethod` ¶

ragpill.evaluators.RegexInOutputEvaluator `dataclass` ¶

from_csv_line `classmethod` ¶

run `async` ¶

ragpill.evaluators.LiteralQuoteEvaluator `dataclass` ¶

from_csv_line `classmethod` ¶

run `async` ¶

ragpill.evaluators.HasQuotesEvaluator `dataclass` ¶

from_csv_line `classmethod` ¶

run `async` ¶

ragpill.evaluators.SpanBaseEvaluator `dataclass` ¶

ragpill.evaluators.SourcesBaseEvaluator `dataclass` ¶

run `async` ¶