Evaluators¶
What are Evaluators?¶
Evaluators are the core components that check whether an LLM output (this can include the sources that were used to create the output, tool-calls,etc.) meets specified criteria. Each evaluator performs a specific type of check and returns a pass/fail result with reasoning.
Built-in Evaluators¶
See Evaluators
Creating Custom Evaluators¶
Basic Pattern¶
All custom evaluators must:
1. Inherit from BaseEvaluator or one of the other useful Base Evaluators
2. Implement from_csv_line() class method with standard signature
3. Implement async run() method
from typing import Any
from ragpill.base import BaseEvaluator, EvaluatorMetadata
from pydantic_evals.evaluators import EvaluationReason
from pydantic_evals.evaluators.context import EvaluatorContext
class MyEvaluator(BaseEvaluator):
"""Description of what this evaluator checks."""
@classmethod
def from_csv_line(
cls,
expected: bool,
tags: set[str],
check: str,
**kwargs: Any
):
"""Create evaluator from CSV row data.
This class method is required for CSV integration.
The signature must be exactly this - do not add custom parameters here.
Use the 'check' parameter for per-instance config (see examples below).
"""
return cls(
expected=expected,
tags=tags,
attributes=kwargs,
)
async def run(
self,
ctx: EvaluatorContext[object, object, EvaluatorMetadata],
) -> EvaluationReason:
# Your evaluation logic
passed = self._check_condition(ctx.output)
return EvaluationReason(
value=passed,
reason=f"Explanation of why it {'passed' if passed else 'failed'}",
)
def _check_condition(self, output: str) -> bool:
# Helper method
return True
Parameterization Patterns¶
There are two ways to parameterize custom evaluators:
Pattern 1: Environment Variables (for shared configuration)¶
Use this for configuration shared across all instances (API keys, global thresholds, etc.):
from pydantic_settings import BaseSettings, SettingsConfigDict
from pydantic import SecretStr
class LengthEvaluatorSettings(BaseSettings):
"""Settings loaded from environment variables."""
model_config = SettingsConfigDict(env_prefix='LENGTH_EVAL_')
api_key: SecretStr
min_length: int = 10
max_length: int = 1000
class LengthEvaluator(BaseEvaluator):
"""Checks if output length is within bounds from settings."""
settings: LengthEvaluatorSettings
@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any):
return cls(
expected=expected,
tags=tags,
attributes=kwargs,
settings=LengthEvaluatorSettings(),
)
async def run(
self,
ctx: EvaluatorContext[object, object, EvaluatorMetadata],
) -> EvaluationReason:
length = len(ctx.output)
passed = self.settings.min_length <= length <= self.settings.max_length
return EvaluationReason(
value=passed,
reason=f"Length {length} (range: {self.settings.min_length}-{self.settings.max_length})",
)
# Set environment variables:
# export LENGTH_EVAL_MIN_LENGTH=50
# export LENGTH_EVAL_MAX_LENGTH=500
Pattern 2: JSON in check Column (for per-instance configuration)¶
Use this for parameters that vary per test case (regex patterns, specific values, etc.):
import json
class RegexEvaluator(BaseEvaluator):
"""Checks if output matches a regex pattern from check column."""
pattern: str
@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any):
"""Parse pattern from check column (plain text or JSON)."""
try:
check_dict = json.loads(check)
if isinstance(check_dict, dict):
pattern = check_dict.get('pattern', check)
else:
pattern = check
except json.JSONDecodeError:
# Plain text - use as pattern
pattern = check
return cls(
expected=expected,
tags=tags,
attributes=kwargs,
pattern=pattern,
)
async def run(
self,
ctx: EvaluatorContext[object, object, EvaluatorMetadata],
) -> EvaluationReason:
import re
regex = re.compile(self.pattern, re.IGNORECASE)
match = regex.search(ctx.output)
passed = match is not None
return EvaluationReason(
value=passed,
reason=f"Pattern '{self.pattern}' {'found' if passed else 'not found'}",
)
# CSV examples:
# Plain text pattern:
# Question,test_type,expected,tags,check
# What is Python?,RegexEvaluator,true,tech,programming language
#
# JSON pattern with additional config:
# What is Python?,RegexEvaluator,true,tech,"{\"pattern\": \".*programming.*\"}"
Real-World Example: Built-in Evaluator¶
See the built-in RegexInDocumentMetadataEvaluator for a complete example that uses JSON configuration:
# From evaluators.py
@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any):
"""Create evaluator from CSV with JSON in check column."""
try:
check_dict = json.loads(check)
if isinstance(check_dict, dict):
pattern = check_dict.get('pattern')
metadata_key = check_dict.get('key')
else:
raise ValueError("check must be a JSON object")
except json.JSONDecodeError:
raise ValueError(
f"RegexInDocumentMetadataEvaluator requires 'check' to be a JSON string "
f"with 'pattern' and 'key'. Got: {check}"
)
return cls(
expected=expected,
tags=tags,
attributes=kwargs,
pattern=pattern,
metadata_key=metadata_key,
)
# CSV usage:
# Question,test_type,expected,tags,check
# Query docs,RegexInDocumentMetadata,true,retrieval,"{\"pattern\": \".*2024.*\", \"key\": \"date\"}"
Custom Attributes¶
You can add custom attributes to evaluators by adding columns to your CSV:
Question,test_type,expected,tags,check,priority,category
What is X?,LLMJudge,true,factual,answer_correctness,high,science
What is Y?,RegexEvaluator,false,format,email_format,low,validation
These custom columns (like priority and category) are automatically:
1. Passed to each evaluator's attributes dict via the **kwargs in from_csv_line()
2. Available in your evaluator through self.attributes
Important: If all evaluators for a given question have the same value for an attribute, that attribute becomes part of the Test Case metadata and will be visible in MLflow tracking.
# In code - extend default evaluators with your custom class
from ragpill.csv.testset import load_testset, default_evaluator_classes
evaluator_classes = default_evaluator_classes | {
'MyEvaluator': MyEvaluator,
}
dataset = load_testset(
csv_path="testset.csv",
evaluator_classes=evaluator_classes,
)
# Access attributes in your evaluator
class MyEvaluator(BaseEvaluator):
@classmethod
def from_csv_line(cls, expected: bool, tags: set[str], check: str, **kwargs: Any):
return cls(
expected=expected,
tags=tags,
attributes=kwargs, # Contains {priority: "high", category: "science"}
)
async def run(self, ctx):
priority = self.attributes.get('priority', 'medium')
# Use the attribute in your logic
...
Multiple Evaluators¶
You can attach multiple evaluators to a single test case:
case = Case(
input=BaseTestInput(metadata=metadata),
evaluators=[
LLMJudge(...), # Check correctness
LengthEvaluator(...), # Check length
RegexEvaluator(...), # Check format
],
)
expected Inheritance¶
When constructing evaluators programmatically, expected defaults
to None. At evaluation time it is resolved via
merge_metadata:
- Non-global evaluators: evaluator value wins; if
None, falls back to case metadata. - Global evaluators: case metadata wins; if
None, falls back to evaluator value. - If both are
None, the final default isTrue.
This lets you set a case-wide default and override it only where needed:
Case(
inputs="How many r's are in 'strawberry'?",
metadata=TestCaseMetadata(expected=False), # default for all evaluators
evaluators=[
RegexInOutputEvaluator(pattern="1"), # inherits expected=False
RegexInOutputEvaluator(pattern="2"), # inherits expected=False
RegexInOutputEvaluator(pattern="3", expected=True), # overrides to True
RegexInOutputEvaluator(pattern="4"), # inherits expected=False
],
)
Evaluator Results¶
EvaluationReason Structure¶
Each evaluator's run() method returns an EvaluationReason:
EvaluationReason(
value=True, # Whether the check passed
reason="Explanation...", # Human-readable explanation
)
Best Practices¶
1. Clear Naming¶
Use descriptive names for your evaluators:
# Good
class EmailFormatEvaluator(BaseEvaluator): ...
class SentimentPositivityEvaluator(BaseEvaluator): ...
# Avoid
class Evaluator1(BaseEvaluator): ...
class CheckThing(BaseEvaluator): ...
2. Informative Reasons¶
Provide helpful explanations in reason:
# Good
reason = f"Found 3 of 5 required keywords: {found_keywords}"
# Less helpful
reason = "Failed"
3. Deterministic When Possible¶
Prefer deterministic checks over LLM judges when possible:
- Regex for format validation
- Length for size constraints
- Keyword matching for required terms
Use LLM judges for: - Semantic correctness - Tone and style
5. Document Your Evaluators¶
Add docstrings explaining what and why:
class KeywordEvaluator(BaseEvaluator):
"""Checks if specific keywords are present in the output.
Useful for ensuring important terms are mentioned without
requiring exact phrasing. Can check for any keyword (OR logic)
or all keywords (AND logic).
Args:
keywords: List of keywords to check for
require_all: If True, all keywords must be present (AND).
If False, any keyword is sufficient (OR).
"""
Handling of Evaluator Failures¶
By default, if an evaluator raises an exception during run(), it will be treated as a failure (i.e., value=False) and the exception message will be recorded in the reason. This ensures that unexpected errors in evaluators do not crash the entire evaluation process and are properly logged for debugging.