Skip to content

Full Tutorial

# For users following this tutorial:
# !pip install ragpill langchain-huggingface sentence-transformers ipywidgets
# Or if you're developing locally with this repo:
# !uv sync --group docs

Settings

Context: The settings are optional but an easy way to configure for standard use cases. If you need more control over the client that is passed to LLMJudge, you can also pass a pydantic-ai model directly.

To set up mlflow tracking and LLMJudge, you can create a .env file in the root of your project with the following content:

# --- MLflow ---
# URI of your MLflow tracking server (default: http://localhost:5000)
MLFLOW_RAGPILL_TRACKING_URI=http://localhost:5000

# Name of the MLflow experiment (default: ragpill_experiment)
MLFLOW_RAGPILL_EXPERIMENT_NAME=my_experiment

# Human-readable description attached to each MLflow run (default: RAGPill Evaluation Run)
MLFLOW_RAGPILL_RUN_DESCRIPTION=My Evaluation Run

# Optional: credentials for a secured MLflow server
MLFLOW_TRACKING_USERNAME=your_username
MLFLOW_TRACKING_PASSWORD=your_password

# --- LLMJudge ---
# Model name passed to the LLM provider (default: gpt-4o)
RAGPILL_LLMJUDGE_MODEL_NAME=gpt-4o

# Sampling temperature for the judge model (default: 0.0)
RAGPILL_LLMJUDGE_TEMPERATURE=0.0

# Optional: base URL for a custom or self-hosted LLM API endpoint.
# Falls back to OPENAI_BASE_URL, then https://api.openai.com/v1
RAGPILL_LLMJUDGE_BASE_URL=https://your-llm-endpoint/v1

# Optional: API key for the LLM service.
# Falls back to OPENAI_API_KEY if not set.
RAGPILL_LLMJUDGE_API_KEY=sk-...

The cell below loads these settings and prints the resolved values so you can verify your configuration before running evaluations.

from ragpill.settings import MLFlowSettings, LLMJudgeSettings
from pprint import pprint
mlflow_settings = MLFlowSettings()
pprint(mlflow_settings)
llm_judge_settings = LLMJudgeSettings()
pprint(llm_judge_settings)
MLFlowSettings(ragpill_tracking_uri='http://127.0.0.1:5000', ragpill_experiment_name='ragpill_tutorial', tracking_username=None, tracking_password=None, ragpill_run_description='RAGPill Evaluation Run')
LLMJudgeSettings(model_name='openai/gpt-4o', temperature=0.0, base_url='https://openrouter.ai/api/v1', api_key=SecretStr('**********'))

Defining the Testset

For more details on best practices for creating a testset, see the Testsets guide.

Here we showcase the use of tags, attributes and "expected" attribute and how its propagated as default from Case to Evaluator.

from pydantic_evals import Dataset, Case
from ragpill import LLMJudge, TestCaseMetadata, RegexInOutputEvaluator
testset = Dataset(
    name="testset1",
    cases=[
        Case(
            inputs="What is the capital of France?",
            metadata=TestCaseMetadata(tags=["geography"], attributes={"author": "janine"}),
            evaluators=[
                LLMJudge(rubric="Paris is the capital of France"),
            ]
        ),
        Case(
            inputs="How many r`s are in 'strawberry'? State only the number",
            metadata=TestCaseMetadata(attributes={"author": "joel"}),
            evaluators=[
                RegexInOutputEvaluator(pattern="1", expected=False),
                RegexInOutputEvaluator(pattern="2", expected=False),
                RegexInOutputEvaluator(pattern="3", expected=True),
                RegexInOutputEvaluator(pattern="4", expected=False),
            ]
        )
    ]
)

Creating the Task

In the end we just need a function (or if async: a coroutine) that takes in an inputs-str (usually a question from the user) and returns a str (usually the answer). We will wrap a very simple pydantic-ai Agent

from openai import AsyncOpenAI
import os
from httpx import AsyncClient

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

api_key = os.environ["RAGPILL_LLMJUDGE_API_KEY"]
base_url = os.environ["RAGPILL_LLMJUDGE_BASE_URL"]
model_name = "qwen/qwen3.6-plus:free"
system_prompt = "You are a helpful assistant"
temperature = 0.2

http_client = AsyncClient()

openai_client = AsyncOpenAI(
    max_retries=3,
    base_url=base_url,
    api_key=api_key,
    http_client=http_client,
)
pyai_llm_model = OpenAIChatModel(
    model_name, provider=OpenAIProvider(openai_client=openai_client), settings={"temperature": temperature}
)

agent = Agent(model=pyai_llm_model, system_prompt=system_prompt)

Running the Evaluation

import mlflow
from mlflow import MlflowClient

tracking_uri = mlflow_settings.ragpill_tracking_uri
try:
    client = MlflowClient(tracking_uri=tracking_uri)
    client.search_experiments()
    print(f"MLflow server reachable at {tracking_uri}")
except Exception:
    print(
        f"โš ๏ธ  MLflow server not reachable at {tracking_uri}.\n"
        "To track task runs start the server first:\n\n"
        "    mlflow server --backend-store-uri sqlite:///mlflow.db\n\n"
        "Then re-run this cell before continuing."
    )
    raise
MLflow server reachable at http://127.0.0.1:5000
async def task(question: str) -> str:
    result = await agent.run(question)
    return result.output

result = await task("Whats the capital of France?")
result
'The capital of France is Paris.'
from ragpill import evaluate_testset_with_mlflow
from IPython.display import display

async def task(question: str) -> str:
    result = await agent.run(question)
    return result.output


results_df = await evaluate_testset_with_mlflow(
    testset=testset,
    task=task,
    model_params={
        "system_prompt": system_prompt,
        "temperature": str(temperature),
        "model_name": model_name,
    }
)
display(results_df)
Output()
/Users/janine/ragpill/.venv/lib/python3.11/site-packages/pydantic_ai/models/openai.py:363: 
UserWarning: Sampling parameters ['temperature'] are not supported when reasoning is enabled. These settings will 
be ignored.
  warnings.warn(


๐Ÿƒ View run shivering-eel-285 at: http://127.0.0.1:5000/#/experiments/1/runs/41a2f7621614499397858b8b710fd700
๐Ÿงช View experiment at: http://127.0.0.1:5000/#/experiments/1
inputs output evaluator_result evaluator_data evaluator_reason expected mandatory attributes tags task_duration evaluator_name case_name case_id source_type source_id input_key trace_id
0 How many r`s are in 'strawberry'? State only t... 3 True {'evaluation_name': UUID('862e3b43-0816-4914-8... Regex pattern "1" did not match output. False True b'{"author":"joel"}' {} 8.300706 RegexInOutputEvaluator Case 2 186316f6b7363af58747eb755446be42 CODE RegexInOutputEvaluator 186316f6b7363af58747eb755446be42 tr-31fbe3fa24195939e22b9cb3accd3309
1 How many r`s are in 'strawberry'? State only t... 3 True {'evaluation_name': UUID('c3bffbad-f529-48bd-9... Regex pattern "2" did not match output. False True b'{"author":"joel"}' {} 8.300706 RegexInOutputEvaluator_2 Case 2 186316f6b7363af58747eb755446be42 CODE RegexInOutputEvaluator 186316f6b7363af58747eb755446be42 tr-31fbe3fa24195939e22b9cb3accd3309
2 How many r`s are in 'strawberry'? State only t... 3 True {'evaluation_name': UUID('1841aa0b-c08b-4e22-8... Regex pattern "3" matched output. True True b'{"author":"joel"}' {} 8.300706 RegexInOutputEvaluator_3 Case 2 186316f6b7363af58747eb755446be42 CODE RegexInOutputEvaluator 186316f6b7363af58747eb755446be42 tr-31fbe3fa24195939e22b9cb3accd3309
3 How many r`s are in 'strawberry'? State only t... 3 True {'evaluation_name': UUID('8dba0bf5-a737-405b-a... Regex pattern "4" did not match output. False True b'{"author":"joel"}' {} 8.300706 RegexInOutputEvaluator_4 Case 2 186316f6b7363af58747eb755446be42 CODE RegexInOutputEvaluator 186316f6b7363af58747eb755446be42 tr-31fbe3fa24195939e22b9cb3accd3309
4 What is the capital of France? The capital of France is Paris. True Paris is the capital of France The output correctly states that the capital o... True True b'{"author":"janine"}' {geography} 6.488402 LLMJudge Case 1 cb0b4aaf80c43c9973aefeda1bd72890 LLM_JUDGE LLMJudge cb0b4aaf80c43c9973aefeda1bd72890 tr-9c4a0bcdd2ea67283d2a3c1d2a23ab49