Skip to content

Evaluation

Verified against google-adk==2.3.0 (google/adk/evaluation/).

ADK ships a first-class evaluation framework built around three concepts: EvalCase (a single conversation to run), EvalSet (a collection of cases), and AgentEvaluator (the engine that runs cases against a live agent and scores the results). The framework integrates with pytest and supports custom metrics.

import asyncio
import pytest
from google.adk.evaluation.agent_evaluator import AgentEvaluator
from google.adk.evaluation.eval_case import EvalCase, Invocation, SessionInput
from google.adk.evaluation.eval_set import EvalSet
from google.adk.evaluation.eval_metrics import PrebuiltMetrics
from google.adk.evaluation.eval_config import EvalConfig
from google.genai import types
# Define a single-turn eval case
case = EvalCase(
eval_id="add_two_numbers",
conversation=[
Invocation(
user_content=types.Content(
role="user",
parts=[types.Part(text="What is 15 + 27?")],
),
final_response=types.Content(
role="model",
parts=[types.Part(text="42")],
),
)
],
)
eval_set = EvalSet(
eval_set_id="arithmetic_suite",
eval_cases=[case],
)
eval_config = EvalConfig(
criteria={
PrebuiltMetrics.RESPONSE_MATCH_SCORE.value: 0.8,
}
)
# Run — agent_module must expose `root_agent` or `get_agent_async`
@pytest.mark.asyncio
async def test_arithmetic():
await AgentEvaluator.evaluate_eval_set(
agent_module="my_package.agent",
eval_set=eval_set,
eval_config=eval_config,
num_runs=1,
)

The atomic unit. Defined in evaluation/eval_case.py.

from google.adk.evaluation.eval_case import (
EvalCase, Invocation, SessionInput, IntermediateData
)
from google.genai import types
case = EvalCase(
eval_id="weather_lookup", # unique within an EvalSet
session_input=SessionInput( # optional initial state
app_name="weather_app",
user_id="test_user",
state={"preferred_units": "metric"},
),
conversation=[
Invocation(
user_content=types.Content(
role="user",
parts=[types.Part(text="What's the weather in London?")],
),
final_response=types.Content(
role="model",
parts=[types.Part(text="It's currently 18°C and partly cloudy.")],
),
intermediate_data=IntermediateData(
tool_uses=[
types.FunctionCall(name="get_weather", args={"city": "London"}),
],
),
),
],
final_session_state={"last_city": "London"}, # optional; asserted after the run
)
case = EvalCase(
eval_id="two_turn_booking",
conversation=[
Invocation(
user_content=types.Content(
role="user",
parts=[types.Part(text="Book a table for 2 at 7pm.")],
),
final_response=types.Content(
role="model",
parts=[types.Part(text="Which restaurant?")],
),
),
Invocation(
user_content=types.Content(
role="user",
parts=[types.Part(text="La Trattoria.")],
),
final_response=types.Content(
role="model",
parts=[types.Part(text="Done! Table booked at La Trattoria for 2 at 7pm.")],
),
intermediate_data=IntermediateData(
tool_uses=[
types.FunctionCall(
name="book_table",
args={"restaurant": "La Trattoria", "covers": 2, "time": "19:00"},
),
],
),
),
],
)

Invocation fields:

FieldTypePurpose
user_contenttypes.ContentThe user message for this turn
final_responsetypes.Content | NoneExpected final agent response (used by response metrics)
intermediate_dataIntermediateData | NoneExpected tool calls + responses (used by trajectory metrics)
rubricslist[Rubric] | NonePer-invocation rubrics (used by rubric_based_* metrics)
app_detailsAppDetails | NoneOverride app name / user id for this invocation

A collection of EvalCase objects. Defined in evaluation/eval_set.py.

from google.adk.evaluation.eval_set import EvalSet
eval_set = EvalSet(
eval_set_id="full_regression",
name="Full regression suite",
description="Tests the booking and weather sub-agents.",
eval_cases=[case], # replace with your list of EvalCase objects
)
# Serialise to JSON file for reuse
with open("eval_data/full_regression.evalset.json", "w") as f:
f.write(eval_set.model_dump_json(indent=2))
# Load from JSON file
from google.adk.evaluation.eval_set import EvalSet
with open("eval_data/full_regression.evalset.json") as f:
eval_set = EvalSet.model_validate_json(f.read())

EvalConfig maps metric names to thresholds or criterion objects. Defined in evaluation/eval_config.py.

from google.adk.evaluation.eval_config import EvalConfig
from google.adk.evaluation.eval_metrics import (
PrebuiltMetrics,
BaseCriterion,
LlmAsAJudgeCriterion,
ToolTrajectoryCriterion,
JudgeModelOptions,
)
config = EvalConfig(
criteria={
# Simple threshold — the metric must score >= value to pass
PrebuiltMetrics.RESPONSE_MATCH_SCORE.value: 0.7,
# LLM-as-judge with custom model and sampling
PrebuiltMetrics.RESPONSE_EVALUATION_SCORE.value: LlmAsAJudgeCriterion(
threshold=0.8,
judge_model_options=JudgeModelOptions(
judge_model="gemini-2.5-pro",
num_samples=3,
),
),
# Tool trajectory with ordered matching
PrebuiltMetrics.TOOL_TRAJECTORY_AVG_SCORE.value: ToolTrajectoryCriterion(
threshold=1.0,
match_type=ToolTrajectoryCriterion.MatchType.IN_ORDER,
),
}
)

All defined in evaluation/eval_metrics.py as PrebuiltMetrics:

Metric keyClassWhat it measures
tool_trajectory_avg_scoreToolTrajectoryCriterionWhether the agent called the expected tools (EXACT / IN_ORDER / ANY_ORDER)
response_match_scoreBaseCriterionLexical similarity between actual and expected final response
response_evaluation_scoreLlmAsAJudgeCriterionLLM judge rating of response quality
final_response_match_v2LlmAsAJudgeCriterionSemantic match using an LLM judge (v2, more robust)
safety_v1BaseCriterionSafety / toxicity score
hallucinations_v1HallucinationsCriterionDetects factual hallucinations
rubric_based_final_response_quality_v1RubricsBasedCriterionRubric-scored response quality
rubric_based_tool_use_quality_v1RubricsBasedCriterionRubric-scored tool selection
multi_turn_task_success_v1Whether a multi-turn task succeeded end-to-end
multi_turn_trajectory_quality_v1Quality of the full multi-turn trajectory
multi_turn_tool_use_quality_v1Tool use quality across all turns
from google.adk.evaluation.eval_metrics import ToolTrajectoryCriterion
# EXACT — actual calls must match expected calls precisely
ToolTrajectoryCriterion(threshold=1.0, match_type=ToolTrajectoryCriterion.MatchType.EXACT)
# IN_ORDER — expected calls must appear in the actual trajectory in order
# (extra calls are allowed between them)
ToolTrajectoryCriterion(threshold=1.0, match_type=ToolTrajectoryCriterion.MatchType.IN_ORDER)
# ANY_ORDER — all expected calls must appear, order doesn't matter
ToolTrajectoryCriterion(threshold=1.0, match_type=ToolTrajectoryCriterion.MatchType.ANY_ORDER)

The engine. All methods are @staticmethod. Defined in evaluation/agent_evaluator.py (source-verified for google-adk==2.3.0).

evaluate_eval_set — programmatic, in-memory

Section titled “evaluate_eval_set — programmatic, in-memory”
from google.adk.evaluation.agent_evaluator import AgentEvaluator
await AgentEvaluator.evaluate_eval_set(
agent_module="my_package.agent", # must expose root_agent or get_agent_async
eval_set=eval_set,
eval_config=eval_config,
num_runs=2, # run each case twice; results are averaged
agent_name=None, # None → root_agent; set to sub-agent name if needed
print_detailed_results=True, # print per-metric breakdown to stdout
)

num_runs=2 (the default) runs each case twice and averages the scores, improving reliability for non-deterministic models. Increase to 5 for stability-sensitive metrics.

How evaluate_eval_set works internally (source-verified):

  1. Loads the agent via _get_agent_for_eval — imports the module and looks for get_agent_async first, then root_agent.
  2. Creates an InMemoryEvalSetsManager and stores the eval set.
  3. Runs each EvalCase num_runs times — calls the agent via a Runner for each turn in the conversation.
  4. After all runs, scores each metric using the registered MetricEvaluatorRegistry.
  5. Averages scores across runs with statistics.mean().
  6. Asserts each metric against its threshold — raises AssertionError if any fails.
  7. Optionally prints a tabular report via pandas/tabulate.
await AgentEvaluator.evaluate(
agent_module="my_package.agent",
eval_dataset_file_path_or_dir="tests/eval_data/", # .test.json or directory
num_runs=2,
initial_session_file="tests/initial_session.json",
)

eval_dataset_file_path_or_dir can be:

  • A path to a single .test.json file (old format) or .evalset.json file (new EvalSet format).
  • A directory — ADK recursively finds all *.test.json files. Note: directory scanning uses the old .test.json suffix only; pass individual .evalset.json paths explicitly.

AgentEvaluator loads the module and looks for (in order):

  1. get_agent_async — an async factory () -> BaseAgent. Checked first.
  2. root_agent — a module-level BaseAgent instance.
my_package/agent.py
from google.adk.agents import LlmAgent
from google.adk.tools import google_search
root_agent = LlmAgent(
name="research_bot",
model="gemini-2.5-flash",
instruction="Answer questions using web search.",
tools=[google_search],
)

Or with factory for dependency injection:

async def get_agent_async():
# Can connect to real DBs, inject credentials, etc.
db = await create_db_pool()
return LlmAgent(
name="db_agent",
tools=[make_db_tool(db)],
)

find_config_for_test_file — auto-load eval config

Section titled “find_config_for_test_file — auto-load eval config”

When running file-based evals, AgentEvaluator can auto-discover a test_config.json file in the same folder:

tests/eval_data/
# Structure:
# test_config.json ← auto-discovered
# my_suite.evalset.json
# tests/eval_data/test_config.json
{
"criteria": {
"tool_trajectory_avg_score": 1.0,
"response_match_score": 0.7
}
}
# Load it manually
config = AgentEvaluator.find_config_for_test_file("tests/eval_data/my_suite.evalset.json")
print(config.criteria) # {"tool_trajectory_avg_score": 1.0, ...}

Full end-to-end example with result capture

Section titled “Full end-to-end example with result capture”
import asyncio
from google.adk.evaluation.agent_evaluator import AgentEvaluator
from google.adk.evaluation.eval_case import EvalCase, Invocation, IntermediateData
from google.adk.evaluation.eval_set import EvalSet
from google.adk.evaluation.eval_config import EvalConfig
from google.adk.evaluation.eval_metrics import PrebuiltMetrics, ToolTrajectoryCriterion
from google.adk.evaluation.local_eval_set_results_manager import LocalEvalSetResultsManager
from google.genai import types
# --- Build eval cases --------------------------------------------------------
cases = [
EvalCase(
eval_id="unit_conversion",
conversation=[
Invocation(
user_content=types.Content(
role="user",
parts=[types.Part(text="Convert 100 Fahrenheit to Celsius.")],
),
final_response=types.Content(
role="model",
parts=[types.Part(text="37.78")],
),
intermediate_data=IntermediateData(
tool_uses=[
types.FunctionCall(
name="convert_temperature",
args={"value": 100, "from_unit": "F", "to_unit": "C"},
)
]
),
)
],
),
EvalCase(
eval_id="multi_turn_booking",
conversation=[
Invocation(
user_content=types.Content(
role="user",
parts=[types.Part(text="Book a flight to Paris.")],
),
final_response=types.Content(
role="model",
parts=[types.Part(text="Which date would you like to travel?")],
),
),
Invocation(
user_content=types.Content(
role="user",
parts=[types.Part(text="June 15th.")],
),
final_response=types.Content(
role="model",
parts=[types.Part(text="Done! Flight booked for June 15th to Paris.")],
),
intermediate_data=IntermediateData(
tool_uses=[
types.FunctionCall(
name="book_flight",
args={"destination": "Paris", "date": "2026-06-15"},
)
]
),
),
],
),
]
eval_set = EvalSet(eval_set_id="travel_agent_suite", eval_cases=cases)
eval_config = EvalConfig(
criteria={
PrebuiltMetrics.TOOL_TRAJECTORY_AVG_SCORE.value: ToolTrajectoryCriterion(
threshold=1.0,
match_type=ToolTrajectoryCriterion.MatchType.IN_ORDER,
),
PrebuiltMetrics.RESPONSE_MATCH_SCORE.value: 0.5,
}
)
# --- Run evaluation ----------------------------------------------------------
async def main():
await AgentEvaluator.evaluate_eval_set(
agent_module="my_package.travel_agent",
eval_set=eval_set,
eval_config=eval_config,
num_runs=2,
print_detailed_results=True,
)
asyncio.run(main())
from google.adk.evaluation.local_eval_set_results_manager import (
LocalEvalSetResultsManager,
)
results_manager = LocalEvalSetResultsManager(results_dir="./eval_results")
# After evaluate_eval_set completes, save results
result = await AgentEvaluator.evaluate_eval_set(...)
await results_manager.save_eval_set_result(result)

Or use GcsEvalSetResultsManager to persist to Cloud Storage:

from google.adk.evaluation.gcs_eval_set_results_manager import GcsEvalSetResultsManager
results_manager = GcsEvalSetResultsManager(
bucket_name="my-eval-results",
eval_storage_dir="runs/",
)
from google.adk.evaluation.eval_config import EvalConfig, CustomMetricConfig
from google.adk.agents.common_configs import CodeConfig
# Implement the metric function in a discoverable module
# my_package/metrics.py
def my_length_metric(
actual_invocation,
expected_invocation,
criterion,
) -> float:
"""Returns 1.0 if the response is ≤ 100 chars, else 0.0."""
if not actual_invocation.final_response:
return 0.0
text = "".join(
p.text or ""
for p in actual_invocation.final_response.parts or []
)
return 1.0 if len(text) <= 100 else 0.0
config = EvalConfig(
criteria={
"response_brevity": 1.0, # threshold to pass
},
custom_metrics={
"response_brevity": CustomMetricConfig(
code_config=CodeConfig(name="my_package.metrics.my_length_metric"),
),
},
)

Rubrics let you score responses against structured criteria instead of a single binary pass/fail:

from google.adk.evaluation.eval_rubrics import Rubric, RubricScore
from google.adk.evaluation.eval_metrics import RubricsBasedCriterion
rubrics = [
Rubric(
criterion="The response must cite at least one source URL.",
points=1,
),
Rubric(
criterion="The response must be written in plain English, no jargon.",
points=1,
),
Rubric(
criterion="The response must be under 200 words.",
points=1,
),
]
config = EvalConfig(
criteria={
"rubric_based_final_response_quality_v1": RubricsBasedCriterion(
threshold=0.8, # fraction of total rubric points required
rubrics=rubrics,
),
}
)
tests/test_agent.py
import pytest
from google.adk.evaluation.agent_evaluator import AgentEvaluator
from google.adk.evaluation.eval_set import EvalSet
from google.adk.evaluation.eval_config import EvalConfig
from google.adk.evaluation.eval_metrics import PrebuiltMetrics
EVAL_SET = EvalSet.model_validate_json(
open("tests/eval_data/regression.evalset.json").read()
)
EVAL_CONFIG = EvalConfig(
criteria={
PrebuiltMetrics.TOOL_TRAJECTORY_AVG_SCORE.value: 1.0,
PrebuiltMetrics.RESPONSE_MATCH_SCORE.value: 0.7,
}
)
@pytest.mark.asyncio
async def test_agent_regression():
await AgentEvaluator.evaluate_eval_set(
agent_module="my_package.agent",
eval_set=EVAL_SET,
eval_config=EVAL_CONFIG,
num_runs=2,
)

Run with:

Terminal window
pytest tests/test_agent.py -v

For tooling compatibility, save eval cases as JSON. The recommended format (new schema) is an EvalSet JSON:

{
"evalSetId": "arithmetic_suite",
"evalCases": [
{
"evalId": "add_two_numbers",
"conversation": [
{
"userContent": {
"role": "user",
"parts": [{ "text": "What is 15 + 27?" }]
},
"finalResponse": {
"role": "model",
"parts": [{ "text": "42" }]
},
"intermediateData": {
"toolUses": []
}
}
]
}
]
}

Save as tests/eval_data/arithmetic_suite.evalset.json. The old .test.json format is still accepted but will emit a migration warning — use AgentEvaluator.migrate_eval_data_to_new_schema() to convert.

AgentEvaluator.migrate_eval_data_to_new_schema(
old_eval_data_file="tests/eval_data/old_tests.test.json",
new_eval_data_file="tests/eval_data/old_tests.evalset.json",
initial_session_file="tests/initial_session.json",
)

Record expected tool calls from a golden run. In CI, ToolTrajectoryCriterion(match_type=IN_ORDER, threshold=1.0) fails the build if the agent forgets a required tool or calls them out of order.

Use RESPONSE_EVALUATION_SCORE or FINAL_RESPONSE_MATCH_V2 with num_samples=5 to get stable scores. Reserve expensive judge metrics for nightly runs; use RESPONSE_MATCH_SCORE (lexical) in fast PR checks.

Three rubrics worth 1 point each. Threshold at 0.67 (≥ 2/3 criteria). Run with num_runs=3 to smooth out judge variance.

Set agent_name="specialist_bot" on evaluate_eval_set to evaluate a sub-agent in isolation, bypassing the root agent’s routing.

Populate final_session_state={"order_confirmed": True} in the EvalCase. ADK asserts the session state matches after the conversation completes. Combine with tool trajectory to verify both the path and the outcome.

  • agent_module must be an importable dotted path (e.g. "my_package.agent"), not a file path. The module must be on sys.path.
  • num_runs=1 can produce flaky results for non-deterministic models. Use num_runs=2 (the default) or higher for metrics that use LLM judges.
  • The criteria dict key must exactly match the PrebuiltMetrics.value string (e.g. "tool_trajectory_avg_score") or a custom metric name registered in custom_metrics.
  • RESPONSE_EVALUATION_SCORE is inherently unstable — the docstring in source says “this evaluation is not very stable”. Treat it as a soft signal, not a hard gate.
  • Old .test.json files are accepted but emit a deprecation warning. Migrate to EvalSet JSON to suppress the warning.
  • SessionInput.state sets the initial session state before the first turn. Mutations during the conversation are not reflected back to session_input.