Evaluators¶
This module provides pre-built evaluators for common evaluation tasks.
As the task can have arbitrary input and output types these evaluators in general coerce those to strings and use string-based evaluation methods (LLM judges, regex checks, etc).
It is, however, pretty easy to create your own custom evaluators by inheriting from BaseEvaluator and implementing the run() method. See Create custom evaluators for a tutorial on how to do this.
LLMJudge¶
ragpill.evaluators.LLMJudge
dataclass
¶
LLMJudge(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, rubric, model=_get_default_judge_llm(), include_input=False)
Bases: BaseEvaluator
The LLMJudge evaluator uses a language model to judge whether an output meets specified rubric.
A rubric usually is one of the following: - A fact that the output should contain or not contain (rubric="Output must contain the fact that Paris is the capital of France.") - About the style of the output (rubric="Output should be in a formal tone." or "Output should be in German")
Note: Avoid complex instructions in the rubric, as the model may not follow them reliably. Instead, try to break it down into multiple instances of the LLMJudge.
metadata
property
¶
Build metadata from evaluator fields.
the default in BaseEvaluator is overridden because excluding the not pickleable model field seems impossible
from_csv_line
classmethod
¶
Create an LLMJudge from a CSV line.
This method is used by the CSV testset loader to instantiate the evaluator.
See load_testset for more details.
For LLMJudge, the check parameter is treated as the rubric text. If check is a JSON object with a 'rubric' key, that value is used. Otherwise, the entire check string is used as the rubric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected
|
bool
|
Expected evaluation result |
required |
tags
|
set[str]
|
Comma-separated tags string |
required |
check
|
str
|
Rubric text or JSON with 'rubric' key |
required |
get_llm
|
Callable[[], Model]
|
Callable that returns a Model instance (defaults to get_default_judge_llm) |
_get_default_judge_llm
|
**kwargs
|
Any
|
Additional parameters (can include 'model' to override default) |
{}
|
Note: The model parameter must be provided. It should come from: - Dependency injection (e.g., a module-level or class-level settings object) - The check column as JSON: {"rubric": "...", "model": "openai:gpt-4o"} - An additional CSV column named 'model'
Source code in src/ragpill/evaluators.py
RegexInSourcesEvaluator¶
ragpill.evaluators.RegexInSourcesEvaluator
dataclass
¶
RegexInSourcesEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, _mlflow_settings=None, _mlflow_experiment_id=None, _mlflow_run_id=None, inputs_to_key_function=default_input_to_key, evaluation_function, custom_reason_true='Evaluation function returned True.', custom_reason_false='Evaluation function returned False.', pattern)
Bases: SourcesBaseEvaluator
Evaluator to check if a regex pattern is found in any of the source document's content. The documents are retrieved from mlflow trace and include documents from retriever, tool, and reranker spans.
Both the pattern and document contents are normalized before matching via
_normalize_text, which applies:
- Case-folding - all text is lowercased (
str.casefold), so matching is always case-insensitive. Using the(?i)flag is therefore redundant. - Unicode NFKC - compatibility characters are unified
(e.g.
UF₆↔UF6). - Whitespace collapsing - runs of whitespace become a single space.
- Quote normalization - curly quotes, guillemets, primes, etc. are
replaced with a straight single quote
'. - Markdown subscript stripping - e.g.
UF~6~→UF6. - Trailing period stripping.
Tip: Use inline regex flags to modify matching behavior:
(?s)pattern- Dotall mode (.matches newlines, useful for multi-line content)(?m)pattern- Multiline mode (^and$match line boundaries)(?ms)pattern- Combine multiple flags
Example
from_csv_line
classmethod
¶
Create a RegexInSourcesEvaluator from a CSV line.
This method is used by the CSV testset loader to instantiate the evaluator.
See load_testset for more details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected
|
bool
|
Expected evaluation result |
required |
tags
|
set[str]
|
Comma-separated tags string |
required |
check
|
str
|
Regex pattern to search for in document contents |
required |
**kwargs
|
Any
|
Additional attributes for the evaluator |
{}
|
Source code in src/ragpill/evaluators.py
RegexInDocumentMetadataEvaluator¶
ragpill.evaluators.RegexInDocumentMetadataEvaluator
dataclass
¶
RegexInDocumentMetadataEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, _mlflow_settings=None, _mlflow_experiment_id=None, _mlflow_run_id=None, inputs_to_key_function=default_input_to_key, evaluation_function, custom_reason_true='Evaluation function returned True.', custom_reason_false='Evaluation function returned False.', metadata_key, pattern)
Bases: SourcesBaseEvaluator
Evaluator to check if a regex pattern is found in a specific metadata field of any document retrieved from mlflow trace.
The documents are retrieved from mlflow trace and include documents from retriever, tool, and reranker spans.
Note: For creating from csv, requires 'check' to be a JSON string with 'pattern' and 'key' fields. Then checks if any document in the used sources has metadata[key] matching the regex pattern.
Both the pattern and metadata values are normalized before matching via
_normalize_text, which applies case-folding (str.casefold),
Unicode NFKC, whitespace collapsing, and quote normalization. Because text
is already case-folded, the (?i) flag is redundant.
Inline regex flags still work:
(?s)pattern- Dotall mode (.matches newlines, useful for multi-line metadata values)(?m)pattern- Multiline mode (^and$match line boundaries)(?ms)pattern- Combine multiple flags
Example
from_csv_line
classmethod
¶
Create a RegexInDocumentMetadataEvaluator from a CSV line.
This method is used by the CSV testset loader to instantiate the evaluator.
See load_testset for more details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected
|
bool
|
Expected evaluation result |
required |
tags
|
set[str]
|
Comma-separated tags string |
required |
check
|
str
|
json with 2 keys: "pattern" and "key". Regex pattern to search for in document metadata key. |
required |
**kwargs
|
Any
|
Additional attributes for the evaluator |
{}
|
Source code in src/ragpill/evaluators.py
RegexInOutputEvaluator¶
ragpill.evaluators.RegexInOutputEvaluator
dataclass
¶
RegexInOutputEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, pattern)
Bases: BaseEvaluator
Check whether a regex pattern matches the stringified output.
Both the pattern and the output are normalized before matching via
_normalize_text, which applies case-folding (str.casefold),
Unicode NFKC, whitespace collapsing, and quote normalization.
Because text is already case-folded, the (?i) flag is redundant.
CSV usage examples
check="error|failure"check='{"pattern": "success"}'
from_csv_line
classmethod
¶
Create a RegexInOutputEvaluator from a CSV line.
Source code in src/ragpill/evaluators.py
LiteralQuoteEvaluator¶
ragpill.evaluators.LiteralQuoteEvaluator
dataclass
¶
Bases: SourcesBaseEvaluator
Verify that all markdown quotes in the output appear literally in source documents.
This evaluator ensures citations are accurate by checking that any text quoted
in markdown blockquotes (lines starting with >) actually appears in the
retrieved source documents. This is particularly valuable for RAG systems where
accuracy of quoted material is critical.
The evaluator:
- Extracts all markdown blockquotes (lines starting with
>) from the output - Cleans quotes by removing quotation marks and normalizing whitespace
- Verifies each quote appears literally (ignoring whitespace) in source documents
- Reports any missing quotes with their referenced filenames when available
Only lines starting with > (after leading whitespace) are considered markdown
quotes. Regular quoted text like "this" or 'this' is ignored.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected
|
bool
|
Expected evaluation result (default: True) |
True
|
tags
|
set[str] | None
|
Set of tags for categorizing this evaluator |
None
|
attributes
|
dict[str, Any] | None
|
Additional attributes for the evaluator |
None
|
Example
from ragpill.evaluators import LiteralQuoteEvaluator
# Create evaluator
evaluator = LiteralQuoteEvaluator(
expected=True,
tags={"quotation", "accuracy"}
)
# Output with markdown quote
output = '''
The report states:
> "'no longer outstanding at this stage' does not mean 'resolved'."
(File: [report.txt](link), Paragraph: 38)
'''
# The evaluator will verify this quote exists in the source documents
Markdown Quote Format
The evaluator recognizes standard markdown blockquotes:
Note
- Whitespace differences between quotes and source text are ignored
- Quotation marks (
",',',',",") are stripped before comparison - File references in format
(File: [filename](...))are extracted and included in error messages - Empty quotes (after cleaning) are skipped
- Quotes must appear literally in source documents (no fuzzy matching)
See Also
SourcesBaseEvaluator:
Base class that retrieves source documents from MLflow traces
RegexInSourcesEvaluator:
Similar evaluator using regex patterns instead of literal quotes
Source code in src/ragpill/evaluators.py
from_csv_line
classmethod
¶
Create a LiteralQuoteEvaluator from a CSV line.
This method is used by the CSV testset loader to instantiate the evaluator.
See load_testset for more details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected
|
bool
|
Expected evaluation result |
required |
tags
|
set[str]
|
Comma-separated tags string |
required |
check
|
str
|
Not used for this evaluator (can be empty) |
required |
**kwargs
|
Any
|
Additional attributes for the evaluator |
{}
|
Source code in src/ragpill/evaluators.py
run
async
¶
Override run to have access to both output and documents.
Source code in src/ragpill/evaluators.py
HasQuotesEvaluator¶
ragpill.evaluators.HasQuotesEvaluator
dataclass
¶
HasQuotesEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, min_quotes=1, max_quotes=-1)
Bases: BaseEvaluator
Check if the output contains a minimum (and optionally maximum) number of markdown quotes.
This evaluator verifies that the output includes at least a specified number
of markdown blockquotes (lines starting with >). Useful for ensuring responses
include citations, evidence, or quoted material.
Only lines starting with > (after leading whitespace) are considered markdown
quotes. Regular quoted text like "this" or 'this' is ignored.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_quotes
|
int
|
Minimum number of quotes required (default: 1) |
1
|
max_quotes
|
int
|
Maximum number of quotes allowed (default: -1, meaning no maximum) |
-1
|
expected
|
bool | None
|
Expected evaluation result (default: True) |
None
|
tags
|
set[str]
|
Set of tags for categorizing this evaluator |
set()
|
attributes
|
dict[str, Any]
|
Additional attributes for the evaluator |
dict()
|
Example
from ragpill.evaluators import HasQuotesEvaluator
# Require at least 2 quotes
evaluator = HasQuotesEvaluator(
min_quotes=2,
expected=True,
tags={"quotation", "format"}
)
# Require between 2 and 5 quotes
evaluator = HasQuotesEvaluator(
min_quotes=2,
max_quotes=5,
expected=True,
tags={"quotation", "format"}
)
# This output has 2 quotes and will pass
output = '''
The report states two key points:
> "First important point."
And also:
> "Second important point."
'''
Note
- Multi-line quotes (consecutive lines with
>) are counted as one quote - Empty quotes (only whitespace after
>) are not counted - The evaluator passes if min_quotes <= num_quotes <= max_quotes (or no max if max_quotes=-1)
- Set expected=False to verify that quotes are NOT within the specified range
See Also
LiteralQuoteEvaluator:
Verifies quotes appear literally in source documents
BaseEvaluator:
Base class for all evaluators
from_csv_line
classmethod
¶
Create a HasQuotesEvaluator from a CSV line.
This method is used by the CSV testset loader to instantiate the evaluator.
See load_testset for more details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected
|
bool
|
Expected evaluation result |
required |
tags
|
set[str]
|
Comma-separated tags string |
required |
check
|
str
|
Either an integer for min_quotes, or JSON with 'min_quotes' and optionally 'max_quotes'. If empty, defaults to min_quotes=1, max_quotes=-1. |
required |
**kwargs
|
Any
|
Additional attributes for the evaluator |
{}
|
Example
In CSV, use check="3" to require at least 3 quotes. Or use check='{"min_quotes": 2, "max_quotes": 5}' to require 2-5 quotes.
Source code in src/ragpill/evaluators.py
run
async
¶
Check if output contains the required number of quotes (within min/max bounds).
Source code in src/ragpill/evaluators.py
Base Evaluators¶
These are Evaluators that are useful to inherit from. See Create custom evaluators
WrappedPydanticEvaluator¶
ragpill.evaluators.WrappedPydanticEvaluator
dataclass
¶
WrappedPydanticEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, pydantic_evaluator)
Bases: BaseEvaluator
Wrapper to use any pydantic-evals Evaluator as a ragpill BaseEvaluator. See https://ai.pydantic.dev/evals/evaluators/overview/ for a list. Limitation: Span-Based evaluators are not supported as logfire is not supported in ragpill yet.
Note: If you want to use pydantic-evals evaluators in your csv-defined testsets, you need to define a subclass of this class that implements from_csv_line to create the specific pydantic evaluator.
Attributes:
| Name | Type | Description |
|---|---|---|
pydantic_evaluator |
Evaluator
|
The pydantic-evals Evaluator instance to wrap. |
Example
```python from pydantic_evals.evaluators import SomePydanticEvaluator from ragpill.base import WrappedPydanticEvaluator ragpill_evaluator = WrappedPydanticEvaluator( pydantic_evaluator=SomePydanticEvaluator(...), expected=True, tags={"tag1", "tag2"}, attributes={"attr1": "value1"}, )
SpanBaseEvaluator¶
ragpill.evaluators.SpanBaseEvaluator
dataclass
¶
SpanBaseEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, _mlflow_settings=None, _mlflow_experiment_id=None, _mlflow_run_id=None, inputs_to_key_function=default_input_to_key)
Bases: BaseEvaluator
This base class that retrieves the spans from mlflow trace. This allows subclasses to implement evaluation logic based on spans. Why is this useful? See https://ai.pydantic.dev/evals/evaluators/span-based/
Why Span-Based Evaluation? Traditional evaluators assess task inputs and outputs. For simple tasks, this may be sufficient—if the output is correct, the task succeeded. But for complex multi-step agents, the process matters as much as the result:
A correct answer reached incorrectly - An agent might produce the right output by accident (e.g., guessing, using cached data when it should have searched, calling the wrong tools but getting lucky) Verification of required behaviors - You need to ensure specific tools were called, certain code paths executed, or particular patterns followed Performance and efficiency - The agent should reach the answer efficiently, without unnecessary tool calls, infinite loops, or excessive retries Safety and compliance - Critical to verify that dangerous operations weren't attempted, sensitive data wasn't accessed inappropriately, or guardrails weren't bypassed
Real-World Scenarios Span-based evaluation is particularly valuable for:
RAG systems - Verify documents were retrieved and reranked before generation, not just that the answer included citations Multi-agent coordination - Ensure the orchestrator delegated to the right specialist agents in the correct order Tool-calling agents - Confirm specific tools were used (or avoided), and in the expected sequence Debugging and regression testing - Catch behavioral regressions where outputs remain correct but the internal logic deteriorates Production alignment - Ensure your evaluation assertions operate on the same telemetry data captured in production, so eval insights directly translate to production monitoring
How It Works When tracing the mlflow experiment, a hash of the input is stored as a span attribute (input_key). The evaluator uses this to find the trace for the given input of the running experiment.
Which tools were called - HasMatchingSpan(query={'name_contains': 'search_tool'}) Code paths executed - Verify specific functions ran or particular branches taken Timing characteristics - Check that operations complete within SLA bounds Error conditions - Detect retries, fallbacks, or specific failure modes Execution structure - Verify parent-child relationships, delegation patterns, or execution order This creates a fundamentally different evaluation paradigm: you're testing behavioral contracts, not just input-output relationships.
SourcesBaseEvaluator¶
ragpill.evaluators.SourcesBaseEvaluator
dataclass
¶
SourcesBaseEvaluator(evaluation_name=uuid4(), expected=None, attributes=dict(), tags=set(), is_global=False, *, _mlflow_settings=None, _mlflow_experiment_id=None, _mlflow_run_id=None, inputs_to_key_function=default_input_to_key, evaluation_function, custom_reason_true='Evaluation function returned True.', custom_reason_false='Evaluation function returned False.')
Bases: SpanBaseEvaluator
This base class that retrieves the sources from mlflow trace.
Note: only documents retrieved from a retriever, reranker or tool span are considered as sources.