Pydantic AI: Powerful Evals Guide (2025)
Pydantic AI: Powerful Evals Guide (2025)
Section titled “Pydantic AI: Powerful Evals Guide (2025)”Systematic Testing with Pydantic Logfire Integration
Section titled “Systematic Testing with Pydantic Logfire Integration”Version: 1.0.0 (2025) Framework: Pydantic AI Testing & Evaluation Focus: Comprehensive agent evaluation, testing, and quality assurance
Table of Contents
Section titled “Table of Contents”- Overview
- Logfire Integration
- Evaluation Frameworks
- Unit Testing Agents
- Integration Testing
- Performance Benchmarking
- Quality Metrics
- Regression Testing
- Production Monitoring
Overview
Section titled “Overview”What are Evals?
Section titled “What are Evals?”Evaluations (Evals) are systematic tests that measure AI agent quality across multiple dimensions:
- Correctness: Does the agent produce accurate outputs?
- Consistency: Are outputs stable across similar inputs?
- Performance: Response time, token usage, cost
- Safety: Harmful content detection, bias testing
- Robustness: Error handling, edge cases
Why Powerful Evals Matter
Section titled “Why Powerful Evals Matter”# ❌ WITHOUT Evals# Deploy agent → Hope it works → User complaints → Fix issues → Repeat
# ✅ WITH Evals# Write evals → Test agent → Measure quality → Fix issues → Deploy confidentlyBenefits:
- Catch issues before production
- Quantify agent improvements
- Prevent regressions
- Build confidence in changes
- Continuous quality monitoring
Logfire Integration
Section titled “Logfire Integration”Automatic Instrumentation
Section titled “Automatic Instrumentation”import logfirefrom pydantic_ai import Agentimport asyncio
# Configure Logfirelogfire.configure( token='your-logfire-token', # From logfire.pydantic.dev service_name='agent-evals', environment='testing')
# Instrument Pydantic AI automaticallylogfire.instrument_pydantic_ai()
# All agent operations are now tracedagent = Agent('openai:gpt-4o')
async def run_eval(): """Run evaluation with automatic tracing."""
# Every run is traced to Logfire result = await agent.run("What is 2 + 2?")
# View traces at logfire.pydantic.dev return result.output
asyncio.run(run_eval())Custom Eval Spans
Section titled “Custom Eval Spans”from pydantic_ai import Agentimport logfirefrom pydantic import BaseModelfrom typing import List
class EvalResult(BaseModel): """Evaluation result structure.""" test_case: str expected: str actual: str passed: bool tokens_used: int latency_ms: float
@logfire.span("run_eval_suite")async def run_eval_suite( agent: Agent, test_cases: List[dict]) -> List[EvalResult]: """Run evaluation suite with Logfire tracing."""
results = []
for test_case in test_cases: # Create span for each test case with logfire.span( "eval_test_case", test_case=test_case['input'] ) as span:
import time start = time.time()
# Run agent result = await agent.run(test_case['input'])
latency = (time.time() - start) * 1000
# Check result passed = result.output.strip() == test_case['expected'].strip()
# Log metrics logfire.info( "test_case_complete", passed=passed, tokens=result.usage().total_tokens, latency_ms=latency )
# Store result eval_result = EvalResult( test_case=test_case['input'], expected=test_case['expected'], actual=result.output, passed=passed, tokens_used=result.usage().total_tokens, latency_ms=latency )
results.append(eval_result)
# Log overall metrics pass_rate = sum(1 for r in results if r.passed) / len(results) avg_latency = sum(r.latency_ms for r in results) / len(results) total_tokens = sum(r.tokens_used for r in results)
logfire.info( "eval_suite_complete", total_tests=len(results), pass_rate=pass_rate, avg_latency_ms=avg_latency, total_tokens=total_tokens )
return results
# Usageagent = Agent('openai:gpt-4o')
test_cases = [ {"input": "What is 2+2?", "expected": "4"}, {"input": "Capital of France?", "expected": "Paris"}, {"input": "Largest ocean?", "expected": "Pacific Ocean"}]
results = await run_eval_suite(agent, test_cases)
# View detailed traces in Logfire dashboardfor result in results: status = "✅" if result.passed else "❌" print(f"{status} {result.test_case}: {result.actual}")Logfire Datasets for Evals
Section titled “Logfire Datasets for Evals”import logfirefrom pydantic_ai import Agentfrom pydantic import BaseModelfrom typing import List
class EvalDataset(BaseModel): """Structured eval dataset.""" name: str version: str test_cases: List[dict]
# Create eval datasetmath_dataset = EvalDataset( name="math_reasoning", version="1.0", test_cases=[ { "input": "If John has 5 apples and gives 2 to Mary, how many does he have?", "expected": "3", "category": "arithmetic" }, { "input": "What is 25% of 80?", "expected": "20", "category": "percentage" } ])
# Log dataset to Logfirelogfire.info( "eval_dataset_created", dataset_name=math_dataset.name, version=math_dataset.version, num_cases=len(math_dataset.test_cases))
@logfire.span("run_dataset_eval")async def evaluate_on_dataset( agent: Agent, dataset: EvalDataset) -> dict: """Evaluate agent on entire dataset."""
results_by_category = {}
for test_case in dataset.test_cases: category = test_case.get('category', 'general')
if category not in results_by_category: results_by_category[category] = []
result = await agent.run(test_case['input'])
passed = result.output.strip() == test_case['expected']
results_by_category[category].append({ 'input': test_case['input'], 'expected': test_case['expected'], 'actual': result.output, 'passed': passed })
# Calculate metrics per category category_metrics = {}
for category, results in results_by_category.items(): pass_rate = sum(1 for r in results if r['passed']) / len(results)
category_metrics[category] = { 'pass_rate': pass_rate, 'total_tests': len(results) }
logfire.info( f"category_{category}_results", pass_rate=pass_rate, total_tests=len(results) )
return category_metrics
# Run evaluationagent = Agent('openai:gpt-4o')metrics = await evaluate_on_dataset(agent, math_dataset)
# View in Logfire with filtering by categoryprint(metrics)Evaluation Frameworks
Section titled “Evaluation Frameworks”Pytest Integration
Section titled “Pytest Integration”import pytestimport pytest_asynciofrom pydantic_ai import Agentfrom pydantic_ai.models.test import TestModelimport logfire
# Configure Logfire for testslogfire.configure(environment='test')logfire.instrument_pydantic_ai()
@pytest_asyncio.fixtureasync def agent(): """Create agent for testing.""" return Agent('openai:gpt-4o')
@pytest_asyncio.fixtureasync def test_agent(): """Create test agent (no API calls).""" return Agent(TestModel())
@pytest.mark.asyncioasync def test_simple_query(agent): """Test agent with simple query."""
with logfire.span("test_simple_query"): result = await agent.run("What is 2+2?")
assert "4" in result.output assert result.usage().total_tokens > 0
@pytest.mark.asyncioasync def test_structured_output(agent): """Test structured output validation."""
from pydantic import BaseModel
class MathResult(BaseModel): answer: int explanation: str
structured_agent = Agent( 'openai:gpt-4o', output_type=MathResult )
result = await structured_agent.run("What is 10 + 15?")
assert result.output.answer == 25 assert len(result.output.explanation) > 0
@pytest.mark.asyncioasync def test_with_mock(test_agent): """Test with mocked model (no API calls)."""
# TestModel returns predefined responses result = await test_agent.run("Any query")
assert result.output is not None assert result.usage().total_tokens == 0 # No actual API call
@pytest.mark.parametrize("input,expected", [ ("2+2", "4"), ("5*3", "15"), ("10-7", "3"),])@pytest.mark.asyncioasync def test_math_operations(agent, input, expected): """Parametrized math tests."""
result = await agent.run(f"What is {input}?")
assert expected in result.output
# Run tests with: pytest test_agent.py -v# View test traces in Logfire dashboardCustom Eval Framework
Section titled “Custom Eval Framework”from pydantic_ai import Agentfrom pydantic import BaseModelfrom typing import List, Callable, Optionalfrom enum import Enumimport asyncioimport logfire
class EvalStatus(str, Enum): PASSED = "passed" FAILED = "failed" ERROR = "error"
class EvalCase(BaseModel): """Single evaluation case.""" name: str input: str expected: Optional[str] = None validator: Optional[str] = None # Function name for custom validation tags: List[str] = []
class EvalSuite(BaseModel): """Collection of eval cases.""" name: str description: str cases: List[EvalCase] validators: dict[str, Callable] = {}
class EvalRunner: """Run evaluation suites."""
def __init__(self, agent: Agent): self.agent = agent logfire.instrument_pydantic_ai()
async def run_case( self, case: EvalCase, validators: dict[str, Callable] ) -> dict: """Run single eval case."""
with logfire.span("eval_case", case_name=case.name): try: # Run agent result = await self.agent.run(case.input)
# Validate result if case.expected: # Simple string match passed = case.expected.lower() in result.output.lower() elif case.validator and case.validator in validators: # Custom validator validator_func = validators[case.validator] passed = await validator_func(case.input, result.output) else: # No validation - just check for output passed = len(result.output) > 0
status = EvalStatus.PASSED if passed else EvalStatus.FAILED
logfire.info( "eval_case_result", status=status.value, tokens=result.usage().total_tokens )
return { "name": case.name, "status": status, "input": case.input, "output": result.output, "expected": case.expected, "tokens": result.usage().total_tokens }
except Exception as e: logfire.error("eval_case_error", error=str(e))
return { "name": case.name, "status": EvalStatus.ERROR, "error": str(e) }
async def run_suite(self, suite: EvalSuite) -> dict: """Run entire eval suite."""
with logfire.span("eval_suite", suite_name=suite.name): logfire.info( "eval_suite_start", suite_name=suite.name, total_cases=len(suite.cases) )
# Run all cases results = await asyncio.gather(*[ self.run_case(case, suite.validators) for case in suite.cases ])
# Calculate metrics passed = sum(1 for r in results if r.get('status') == EvalStatus.PASSED) failed = sum(1 for r in results if r.get('status') == EvalStatus.FAILED) errors = sum(1 for r in results if r.get('status') == EvalStatus.ERROR)
pass_rate = passed / len(results) if results else 0
summary = { "suite_name": suite.name, "total_cases": len(results), "passed": passed, "failed": failed, "errors": errors, "pass_rate": pass_rate, "results": results }
logfire.info( "eval_suite_complete", **summary )
return summary
# Usageagent = Agent('openai:gpt-4o')
# Define custom validatorasync def validate_code_output(input: str, output: str) -> bool: """Validate that output contains valid Python code.""" return "def " in output or "class " in output
# Create eval suitesuite = EvalSuite( name="coding_agent_eval", description="Test coding agent capabilities", cases=[ EvalCase( name="simple_function", input="Write a function to add two numbers", validator="validate_code_output", tags=["coding", "easy"] ), EvalCase( name="class_definition", input="Create a User class with name and email", validator="validate_code_output", tags=["coding", "medium"] ), EvalCase( name="algorithm", input="Implement binary search", validator="validate_code_output", tags=["coding", "hard"] ) ], validators={"validate_code_output": validate_code_output})
# Run evaluationrunner = EvalRunner(agent)results = await runner.run_suite(suite)
print(f"Pass rate: {results['pass_rate']:.1%}")print(f"Passed: {results['passed']}/{results['total_cases']}")Unit Testing Agents
Section titled “Unit Testing Agents”TestModel for Unit Tests
Section titled “TestModel for Unit Tests”from pydantic_ai import Agentfrom pydantic_ai.models.test import TestModel, FunctionModelfrom pydantic import BaseModelimport pytest
# 1. TestModel - Returns simple predefined responses@pytest.mark.asyncioasync def test_with_test_model(): """Test agent logic without API calls."""
agent = Agent(TestModel())
result = await agent.run("Any query")
# TestModel returns structured test data assert result.output is not None assert result.usage().total_tokens == 0 # No actual tokens
# 2. FunctionModel - Custom response logic@pytest.mark.asyncioasync def test_with_function_model(): """Test with custom response function."""
def custom_response(messages): """Generate custom response based on input.""" last_message = messages[-1].content
if "math" in last_message.lower(): return "42" elif "code" in last_message.lower(): return "def hello(): print('world')" else: return "Default response"
agent = Agent(FunctionModel(custom_response))
math_result = await agent.run("Solve this math problem") assert "42" in math_result.output
code_result = await agent.run("Write some code") assert "def hello" in code_result.output
# 3. Testing Tools@pytest.mark.asyncioasync def test_agent_tools(): """Test agent with tools using TestModel."""
agent = Agent(TestModel())
@agent.tool async def calculate(ctx, a: int, b: int) -> int: """Test tool.""" return a + b
# Tool logic is tested independently from pydantic_ai import RunContext
result = await calculate(RunContext(deps=None, model=None, messages=[]), 2, 3) assert result == 5
# 4. Testing Validators@pytest.mark.asyncioasync def test_output_validator(): """Test output validation logic."""
from pydantic_ai import ModelRetry, RunContext
class Result(BaseModel): value: int
agent = Agent(TestModel(), output_type=Result)
@agent.output_validator async def validate_positive(ctx: RunContext, output: Result) -> Result: """Ensure value is positive.""" if output.value < 0: raise ModelRetry("Value must be positive") return output
# Validator is tested with mock data # In production, if model returns negative, it retriesSnapshot Testing
Section titled “Snapshot Testing”from pydantic_ai import Agentimport pytestimport jsonfrom pathlib import Path
class SnapshotTester: """Snapshot testing for agent outputs."""
def __init__(self, snapshot_dir: Path): self.snapshot_dir = snapshot_dir self.snapshot_dir.mkdir(exist_ok=True)
def get_snapshot_path(self, test_name: str) -> Path: """Get path for snapshot file.""" return self.snapshot_dir / f"{test_name}.json"
async def assert_matches_snapshot( self, test_name: str, actual: str, update: bool = False ): """Assert output matches snapshot."""
snapshot_path = self.get_snapshot_path(test_name)
if update or not snapshot_path.exists(): # Create/update snapshot snapshot_path.write_text(json.dumps({"output": actual}, indent=2)) return
# Load snapshot snapshot = json.loads(snapshot_path.read_text()) expected = snapshot["output"]
# Compare assert actual == expected, f"Output doesn't match snapshot: {test_name}"
# Usage@pytest.fixturedef snapshot_tester(): return SnapshotTester(Path(__file__).parent / "snapshots")
@pytest.mark.asyncioasync def test_consistent_output(snapshot_tester): """Test that agent output is consistent."""
agent = Agent('openai:gpt-4o', instructions="Always respond with 'Hello'")
result = await agent.run("Say hello")
# Check against snapshot await snapshot_tester.assert_matches_snapshot( "test_consistent_output", result.output, update=False # Set to True to update snapshots )
# Run with: pytest --update-snapshots (custom flag)Integration Testing
Section titled “Integration Testing”End-to-End Tests
Section titled “End-to-End Tests”from pydantic_ai import Agentfrom pydantic_ai.durable.prefect import PrefectDurableAgentfrom pydantic_ai.mcp import MCPClientimport pytestimport logfire
@pytest.mark.integration@pytest.mark.asyncioasync def test_agent_with_mcp_integration(): """Integration test with MCP server."""
# Start MCP server mcp_client = MCPClient() await mcp_client.connect_stdio( command="npx", args=["-y", "@modelcontextprotocol/server-filesystem", "/tmp/test"] )
# Create agent agent = Agent('openai:gpt-4o')
# Register MCP tools tools = await mcp_client.list_tools() # ... register tools
# Test full workflow with logfire.span("integration_test_mcp"): result = await agent.run("Read file test.txt and summarize")
assert result.output is not None assert len(result.output) > 0
# Cleanup await mcp_client.disconnect()
@pytest.mark.integration@pytest.mark.asyncioasync def test_durable_agent_integration(): """Integration test for durable execution."""
from pydantic_ai.durable import PostgreSQLDurableBackend
# Initialize backend backend = PostgreSQLDurableBackend("postgresql://localhost/test_db") await backend.initialize()
# Create durable agent agent = PrefectDurableAgent( 'openai:gpt-4o', backend=backend, checkpoint_every_n_steps=1 )
workflow_id = "test-workflow-123"
# Execute workflow result = await agent.run( "Multi-step task", workflow_id=workflow_id )
# Verify checkpoints were created checkpoints = await backend.list_checkpoints(workflow_id) assert len(checkpoints) > 0
# Cleanup await backend.delete_checkpoints(workflow_id)
@pytest.mark.integration@pytest.mark.slow@pytest.mark.asyncioasync def test_multi_agent_coordination(): """Integration test for multi-agent workflow."""
from pydantic_ai.a2a import A2AAgentRegistry
registry = A2AAgentRegistry()
# Create agents researcher = Agent('openai:gpt-4o') analyst = Agent('anthropic:claude-3-5-sonnet-latest') writer = Agent('openai:gpt-4o')
# Register agents await registry.register_agent( "researcher-001", ["research"], "http://localhost:8001" )
# Execute coordinated workflow research_result = await researcher.run("Research AI safety") analysis_result = await analyst.run(f"Analyze: {research_result.output}") final_report = await writer.run(f"Write report: {analysis_result.output}")
# Verify workflow completion assert len(final_report.output) > 100
# Run with: pytest -m integrationPerformance Benchmarking
Section titled “Performance Benchmarking”Latency Benchmarks
Section titled “Latency Benchmarks”from pydantic_ai import Agentimport asyncioimport timeimport statisticsimport logfirefrom typing import List
class LatencyBenchmark: """Benchmark agent latency."""
def __init__(self, agent: Agent): self.agent = agent
async def measure_latency( self, query: str, iterations: int = 10 ) -> dict: """Measure latency over multiple iterations."""
latencies = []
with logfire.span("latency_benchmark", iterations=iterations): for i in range(iterations): start = time.perf_counter()
result = await self.agent.run(query)
latency = (time.perf_counter() - start) * 1000 # ms
latencies.append(latency)
logfire.info( f"iteration_{i}", latency_ms=latency, tokens=result.usage().total_tokens )
# Calculate statistics stats = { "mean_ms": statistics.mean(latencies), "median_ms": statistics.median(latencies), "min_ms": min(latencies), "max_ms": max(latencies), "stddev_ms": statistics.stdev(latencies) if len(latencies) > 1 else 0, "p95_ms": sorted(latencies)[int(len(latencies) * 0.95)], "p99_ms": sorted(latencies)[int(len(latencies) * 0.99)] }
logfire.info("benchmark_results", **stats)
return stats
# Usageagent = Agent('openai:gpt-4o')benchmark = LatencyBenchmark(agent)
results = await benchmark.measure_latency("What is 2+2?", iterations=100)
print(f"Mean latency: {results['mean_ms']:.2f}ms")print(f"P95 latency: {results['p95_ms']:.2f}ms")Token Usage Benchmarks
Section titled “Token Usage Benchmarks”from pydantic_ai import Agentimport logfire
class TokenBenchmark: """Benchmark token usage."""
def __init__(self, agent: Agent): self.agent = agent
async def measure_tokens( self, queries: List[str] ) -> dict: """Measure token usage across queries."""
total_input = 0 total_output = 0 total_cost = 0.0
# Token costs (example for GPT-4o) input_cost_per_1k = 0.005 output_cost_per_1k = 0.015
with logfire.span("token_benchmark"): for query in queries: result = await self.agent.run(query) usage = result.usage()
total_input += usage.input_tokens total_output += usage.output_tokens
cost = ( (usage.input_tokens / 1000) * input_cost_per_1k + (usage.output_tokens / 1000) * output_cost_per_1k ) total_cost += cost
avg_input = total_input / len(queries) avg_output = total_output / len(queries) avg_cost = total_cost / len(queries)
stats = { "total_input_tokens": total_input, "total_output_tokens": total_output, "total_tokens": total_input + total_output, "avg_input_tokens": avg_input, "avg_output_tokens": avg_output, "total_cost_usd": total_cost, "avg_cost_usd": avg_cost }
logfire.info("token_benchmark_results", **stats)
return stats
# Usagequeries = [ "Summarize: " + "long text" * 100, "Translate: " + "text" * 50, "Analyze: " + "data" * 75]
benchmark = TokenBenchmark(agent)results = await benchmark.measure_tokens(queries)
print(f"Total cost: ${results['total_cost_usd']:.4f}")print(f"Avg cost per query: ${results['avg_cost_usd']:.4f}")Quality Metrics
Section titled “Quality Metrics”Semantic Similarity Eval
Section titled “Semantic Similarity Eval”from pydantic_ai import Agentfrom sentence_transformers import SentenceTransformerimport numpy as np
class SemanticEvaluator: """Evaluate semantic similarity of outputs."""
def __init__(self): self.model = SentenceTransformer('all-MiniLM-L6-v2')
def cosine_similarity( self, text1: str, text2: str ) -> float: """Calculate cosine similarity between texts."""
embeddings = self.model.encode([text1, text2])
similarity = np.dot(embeddings[0], embeddings[1]) / ( np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1]) )
return float(similarity)
async def evaluate_similarity( self, agent: Agent, test_cases: List[dict], threshold: float = 0.8 ) -> dict: """Evaluate if agent outputs are semantically similar to expected."""
results = []
for case in test_cases: result = await agent.run(case['input'])
similarity = self.cosine_similarity( result.output, case['expected'] )
passed = similarity >= threshold
results.append({ 'input': case['input'], 'similarity': similarity, 'passed': passed })
pass_rate = sum(1 for r in results if r['passed']) / len(results)
return { 'pass_rate': pass_rate, 'avg_similarity': np.mean([r['similarity'] for r in results]), 'results': results }
# Usageevaluator = SemanticEvaluator()
test_cases = [ { "input": "What's the capital of France?", "expected": "The capital of France is Paris." }]
results = await evaluator.evaluate_similarity( agent, test_cases, threshold=0.8)Regression Testing
Section titled “Regression Testing”Regression Test Suite
Section titled “Regression Test Suite”from pydantic_ai import Agentimport logfireimport jsonfrom pathlib import Pathfrom datetime import datetime
class RegressionTester: """Detect regressions in agent quality."""
def __init__(self, baseline_path: Path): self.baseline_path = baseline_path
async def run_regression_tests( self, agent: Agent, test_cases: List[dict] ) -> dict: """Run tests and compare to baseline."""
results = []
for case in test_cases: result = await agent.run(case['input'])
results.append({ 'input': case['input'], 'output': result.output, 'tokens': result.usage().total_tokens })
# Load baseline if self.baseline_path.exists(): baseline = json.loads(self.baseline_path.read_text())
regressions = self._detect_regressions(results, baseline)
if regressions: logfire.error( "regressions_detected", count=len(regressions), regressions=regressions )
return { 'status': 'regression', 'regressions': regressions }
# Update baseline self.baseline_path.write_text(json.dumps({ 'timestamp': datetime.now().isoformat(), 'results': results }, indent=2))
logfire.info("regression_tests_passed")
return {'status': 'passed'}
def _detect_regressions( self, current: List[dict], baseline: dict ) -> List[dict]: """Compare current results to baseline."""
regressions = [] baseline_results = baseline['results']
for i, (curr, base) in enumerate(zip(current, baseline_results)): # Check if output significantly different if curr['output'] != base['output']: # Could use semantic similarity here regressions.append({ 'test_index': i, 'baseline_output': base['output'], 'current_output': curr['output'] })
return regressionsProduction Monitoring
Section titled “Production Monitoring”Continuous Evaluation
Section titled “Continuous Evaluation”from pydantic_ai import Agentimport logfirefrom fastapi import FastAPI, BackgroundTasksimport asyncio
app = FastAPI()
class ProductionEvaluator: """Continuous evaluation in production."""
def __init__(self, agent: Agent): self.agent = agent self.eval_queue = asyncio.Queue()
async def log_production_request( self, user_query: str, agent_response: str, metadata: dict ): """Log production request for eval."""
# Add to eval queue await self.eval_queue.put({ 'query': user_query, 'response': agent_response, 'metadata': metadata })
async def run_continuous_eval(self): """Continuously evaluate production requests."""
while True: # Get batch of requests batch = [] for _ in range(10): # Batch size if not self.eval_queue.empty(): batch.append(await self.eval_queue.get())
if batch: # Run evals on batch with logfire.span("production_eval_batch"): for item in batch: # Example: Check for quality issues has_error = "error" in item['response'].lower() too_short = len(item['response']) < 10
if has_error or too_short: logfire.warn( "quality_issue_detected", query=item['query'], issue="error" if has_error else "too_short" )
await asyncio.sleep(60) # Eval every minute
evaluator = ProductionEvaluator(agent)
@app.on_event("startup")async def startup(): """Start continuous evaluation.""" asyncio.create_task(evaluator.run_continuous_eval())
@app.post("/api/query")async def query_agent( query: str, background_tasks: BackgroundTasks): """Handle query with background evaluation."""
result = await agent.run(query)
# Log for evaluation in background background_tasks.add_task( evaluator.log_production_request, query, result.output, {'tokens': result.usage().total_tokens} )
return {"response": result.output}Summary
Section titled “Summary”Powerful Evals Enable:
- ✅ Systematic agent quality testing
- ✅ Logfire integration for observability
- ✅ Regression detection
- ✅ Performance benchmarking
- ✅ Production monitoring
- ✅ Continuous improvement
Best Practices:
- Write evals before deploying
- Use Logfire for comprehensive tracing
- Test across multiple dimensions (correctness, latency, cost)
- Maintain baseline results for regression detection
- Monitor production continuously
- Update evals as requirements evolve
Next Steps:
- Set up Logfire account and instrumentation
- Create baseline eval suite
- Run benchmarks for latency and cost
- Implement regression testing
- Add production monitoring
- Iterate based on eval results
See Also:
pydantic_ai_comprehensive_guide.md- Core conceptspydantic_ai_production_guide.md- Production deploymentpydantic_ai_durable_execution.md- Fault tolerance