Skip to content

Microsoft Agent Framework (Python) — Compaction

Long-running agents blow through the context window. Compaction strategies decide which messages to keep, which to drop, and which to replace with shorter summaries — per turn, before the messages reach the model.

Six first-class strategies ship in agent_framework, plus a CompactionProvider that plugs any strategy into the session pipeline. Verified against agent-framework-core==1.2.2 (agent_framework._compaction).

Compaction sees your conversation as an ordered list of message groups (user turn, assistant turn, tool-call group, system prompt). Each group is atomic — a tool call and its result stay together. A strategy marks groups excluded=True (or replaces them with a summary); the framework projects the included groups and ships them to the model. Your source history is preserved — only the model’s view shrinks.

StrategyWhat it doesUse when
TruncationStrategyKeep the first N and last M messages; exclude the middleSimple FIFO that preserves system + recent
SlidingWindowStrategyKeep the last keep_last_groups non-system groupsPredictable recency window
SelectiveToolCallCompactionStrategyDrop older tool-call groups onlyTool chatter dominates the history
ToolResultCompactionStrategyCollapse older tool calls into one-line summariesKeep a readable trace but cut tokens
SummarizationStrategyLLM-summarise older groups into a single messageLong chats where earlier context still matters
TokenBudgetComposedStrategyRun other strategies until a token budget is metHard limit per turn; compose lighter strategies first

All implement CompactionStrategy:

class CompactionStrategy(Protocol):
async def __call__(self, messages: list[Message]) -> bool: ...

Returns True if the message list was modified.

from agent_framework import SlidingWindowStrategy, apply_compaction
strategy = SlidingWindowStrategy(
keep_last_groups=20, # keep the last 20 non-system groups
preserve_system=True, # never drop system messages
)
projected = await apply_compaction(messages, strategy=strategy)

Good default. Deterministic, no extra LLM calls, preserves system anchors.

A group is whatever the framework’s group annotator considered a single logical unit — typically a user turn, an assistant turn, or a tool-call group (a function call plus its tool-result message stay together). Counting groups instead of messages is what lets the strategy keep tool calls atomic: it never strands a function_call without its function_result.

import asyncio
from agent_framework import (
Content,
EXCLUDED_KEY,
EXCLUDE_REASON_KEY,
Message,
SlidingWindowStrategy,
apply_compaction,
)
# A short conversation with system anchor + 4 user/assistant pairs + a tool-call group.
messages = [
Message(role="system", contents=[Content.from_text("Be concise.")]),
*[
Message(role=role, contents=[Content.from_text(f"{role} {i}")])
for i in range(4)
for role in ("user", "assistant")
],
Message(role="assistant", contents=[Content.from_function_call(
call_id="c1", name="lookup", arguments={"id": 42},
)]),
Message(role="tool", contents=[Content.from_function_result(
call_id="c1", result={"status": "ok"},
)]),
]
strategy = SlidingWindowStrategy(keep_last_groups=2, preserve_system=True)
asyncio.run(apply_compaction(messages, strategy=strategy))
excluded = [m for m in messages if m.additional_properties.get(EXCLUDED_KEY)]
print(len(excluded), "messages excluded with reason",
excluded[0].additional_properties.get(EXCLUDE_REASON_KEY))
# 6 messages excluded with reason sliding_window

Two things this snippet illustrates:

  • System messages survive preserve_system=True. Even with keep_last_groups=2, the system anchor is untouched and isn’t counted toward the kept-groups quota.
  • The tool-call group counts as one group. Both the function_call and the function_result survive together — keep_last_groups=2 keeps the latest user/assistant turn and the tool group, not 2 messages out of 3.
  • keep_last_groups <= 0 raises ValueError at construction — you cannot configure a window of size zero.
  • preserve_system=False lets the strategy drop the system anchor too. Use this when system instructions are duplicated by an instructions= parameter on the agent (the framework injects them again on each call) and you don’t want to pay for both copies.
  • The strategy mutates messages in place via additional_properties[EXCLUDED_KEY] = True. Call included_messages(messages) to get the projected view, or filter on EXCLUDED_KEY directly when you want to display “32 messages collapsed” in your UI.
from agent_framework import SelectiveToolCallCompactionStrategy
strategy = SelectiveToolCallCompactionStrategy(keep_last_tool_call_groups=3)

Only touches tool-call groups. Use when an agent polls APIs dozens of times — user/assistant turns keep full fidelity; older tool chatter is dropped.

Tool-result compaction — readable summaries

Section titled “Tool-result compaction — readable summaries”
from agent_framework import ToolResultCompactionStrategy
strategy = ToolResultCompactionStrategy(keep_last_tool_call_groups=1)

Replaces older tool-call groups with a short summary like [Tool results: get_weather: sunny, 18°C; get_forecast: rain Tue] instead of excluding them. You keep the provenance at a fraction of the tokens.

keep_last_tool_call_groups=0 collapses every included tool-call group; the most recent ones are summarised too. keep_last_tool_call_groups=N (the default is 1) leaves the newest N groups untouched.

Given a transcript with three tool-call groups for get_weather / get_forecast, the strategy mutates the message list in place and inserts a synthesised summary message at the start of each compacted group while marking the original messages excluded:

import asyncio
from agent_framework import (
Content,
Message,
ToolResultCompactionStrategy,
apply_compaction,
)
messages = [
Message(role="user", contents=[Content.from_text("Weather in Seattle?")]),
Message(role="assistant", contents=[
Content.from_function_call("c1", "get_weather", arguments={"city": "Seattle"}),
]),
Message(role="tool", contents=[
Content.from_function_result("c1", result="sunny, 18°C"),
]),
Message(role="user", contents=[Content.from_text("And Friday?")]),
Message(role="assistant", contents=[
Content.from_function_call("c2", "get_forecast", arguments={"city": "Seattle"}),
]),
Message(role="tool", contents=[
Content.from_function_result("c2", result="clear, 22°C"),
]),
]
projected = asyncio.run(apply_compaction(
messages,
strategy=ToolResultCompactionStrategy(keep_last_tool_call_groups=1),
))
for m in projected:
text = m.text or ", ".join(c.type for c in m.contents)
print(f"[{m.role:9}] {text}")

prints:

[user ] Weather in Seattle?
[assistant] [Tool results: get_weather: sunny, 18°C]
[user ] And Friday?
[assistant] function_call
[tool ] function_result

apply_compaction is the public helper that annotates groups, runs the strategy in place, and returns the projected list a model would actually receive. The first (call_1, function_result) pair is gone from the projection — replaced by the synthetic summary. The most recent group (call_2) survives untouched because keep_last_tool_call_groups=1. The raw messages list still contains the originals (with exclusion annotations) so you can render them in a debug UI or inspect annotations later.

The summary message carries summary_of_message_ids and summary_of_group_ids annotations linking back to the originals, so an audit-trail UI can offer a “show original tool calls” affordance without any extra bookkeeping.

Uses a chat client to summarise older history into a single assistant message:

from agent_framework import SummarizationStrategy
from agent_framework.openai import OpenAIChatClient
summariser = SummarizationStrategy(
client=OpenAIChatClient(model="gpt-4o-mini"),
target_count=8, # leave 8 non-system messages untouched
threshold=4, # trigger when included > target + threshold
)

Triggers only when included_non_system_count > target_count + threshold — idle conversations don’t pay for summaries they don’t need. Once triggered, the strategy walks groups oldest-first, keeps the newest ones up to target_count, and replaces the rest with one summary message that links back to the original group IDs via annotations.

The default prompt preserves goals, decisions, and unresolved items. Override with prompt= when you want a different shape:

support_summariser = SummarizationStrategy(
client=OpenAIChatClient(model="gpt-4o-mini"),
target_count=6,
threshold=3,
prompt=(
"You are summarising a customer support conversation. "
"Produce at most 4 bullet points covering:\n"
"- The customer's problem and any account identifiers mentioned.\n"
"- Diagnostic steps already attempted.\n"
"- Agreements or promises made by the support agent.\n"
"- Any open questions or pending escalations.\n"
"Do NOT restate pleasantries or repeat exact quotes."
),
)

The summary message keeps group_annotation.summary_of_group_ids and summary_of_message_ids metadata so you can later walk back to the original turns — useful for audit trails or “expand this summary” UI affordances.

SummarizationStrategy has two knobs that govern when it runs and how aggressively it shrinks. Both default to small values that are reasonable for chat — but understanding them is the difference between a strategy that fires every turn and one that fires once an hour:

ParameterDefaultWhat it controls
target_count4After summarisation, how many included non-system messages to retain. The strategy keeps groups from the end of history backward until at least target_count messages remain.
threshold2Hysteresis. The strategy only triggers when the current included non-system message count exceeds target_count + threshold. Set higher to trigger less often, lower to trigger sooner.

In practice:

# Aggressive — collapses anything past 6 messages, fires every turn beyond 8.
SummarizationStrategy(client=summariser_client, target_count=4, threshold=2)
# Lazy — only fires once history grows past 60, then collapses to ~40.
SummarizationStrategy(client=summariser_client, target_count=40, threshold=20)
# No hysteresis — re-summarises every turn that exceeds target_count.
SummarizationStrategy(client=summariser_client, target_count=10, threshold=0)

A high threshold is what keeps you from re-running the summariser on every single turn — the LLM call isn’t free, and stable summaries across multiple turns are usually preferred. The strategy preserves atomic group boundaries, so the actual retained count may exceed target_count slightly (e.g. a tool-call group of 3 messages won’t be split).

If the summariser client fails (timeout, rate limit, parse error), SummarizationStrategy logs a warning and returns False without mutating the message list. Composed strategies (TokenBudgetComposedStrategy) then fall through to the next strategy. Pair with a cheap sliding-window strategy so you degrade predictably when summarisation is unavailable.

Compose cheaper strategies first; fall back to hard exclusion to meet a strict cap:

from agent_framework import (
CharacterEstimatorTokenizer,
SelectiveToolCallCompactionStrategy,
SlidingWindowStrategy,
SummarizationStrategy,
TokenBudgetComposedStrategy,
)
from agent_framework.openai import OpenAIChatClient
summariser_client = OpenAIChatClient(model="gpt-4o-mini")
strategy = TokenBudgetComposedStrategy(
token_budget=8_000,
tokenizer=CharacterEstimatorTokenizer(), # swap for a real tokenizer in prod
strategies=[
SelectiveToolCallCompactionStrategy(keep_last_tool_call_groups=2),
SlidingWindowStrategy(keep_last_groups=30),
SummarizationStrategy(client=summariser_client, target_count=10),
],
early_stop=True, # stop as soon as budget is satisfied
)

Order matters — put cheap, deterministic strategies first; summarisation last because it spends tokens. If the composed strategies still exceed budget, the built-in fallback excludes oldest non-system groups (and finally system anchors) to enforce the cap.

import tiktoken
from agent_framework import TokenizerProtocol
class TiktokenTokenizer:
"""Implements the `TokenizerProtocol` — a single `count_tokens(text)` method."""
def __init__(self, model: str = "gpt-4o-mini") -> None:
self._enc = tiktoken.encoding_for_model(model)
def count_tokens(self, text: str) -> int:
return len(self._enc.encode(text))
strategy = TokenBudgetComposedStrategy(
token_budget=8_000,
tokenizer=TiktokenTokenizer(),
strategies=[...],
)

CharacterEstimatorTokenizer ships with the framework for when you just need a rough character→token heuristic. The implementation is intentionally trivial — max(1, len(text) // 4) — so it never returns 0 for non-empty text and avoids dragging in a real tokenizer dependency:

from agent_framework import CharacterEstimatorTokenizer
est = CharacterEstimatorTokenizer()
est.count_tokens("hello") # 1 (5 // 4 → 1)
est.count_tokens("") # 1 (clamped — never 0)
est.count_tokens("a" * 4_000) # 1000

It’s good enough for budget-aware fallbacks during dev/test. For production cost accounting use the real tokenizer that matches your model. Any object with a count_tokens(text: str) -> int method satisfies TokenizerProtocol — the protocol is structurally typed (runtime_checkable), so no inheritance is required.

Two options: per-agent (via the chat client) or per-session (via CompactionProvider). Per-session is more common — it integrates with history providers and persists the compacted state.

from agent_framework import (
Agent,
CompactionProvider,
InMemoryHistoryProvider,
SlidingWindowStrategy,
ToolResultCompactionStrategy,
)
from agent_framework.openai import OpenAIChatClient
history = InMemoryHistoryProvider()
compaction = CompactionProvider(
before_strategy=SlidingWindowStrategy(keep_last_groups=20),
after_strategy=ToolResultCompactionStrategy(keep_last_tool_call_groups=1),
history_source_id=history.source_id,
)
agent = Agent(
client=OpenAIChatClient(),
instructions="You are a research assistant.",
context_providers=[history, compaction],
)
session = agent.create_session()
await agent.run("Kick off research on X.", session=session)
await agent.run("Now write the summary.", session=session) # session history compacted between turns

before_strategy runs when messages are loaded into the run; after_strategy compacts what’s persisted back into session state so the next turn starts smaller. Either can be None to skip that phase.

Knowing exactly when each strategy fires lets you reason about token cost and history retention separately:

agent.run(...)
├─ HistoryProvider.load_session(...) loads stored messages into context
├─ CompactionProvider.before_run(...) ← before_strategy mutates loaded context
├─ ──── model call(s) + tool loop ──── ← uses the (now-compacted) context
├─ HistoryProvider.persist(...) writes new messages into session.state
├─ CompactionProvider.after_run(...) ← after_strategy mutates stored history
│ (looked up by `history_source_id` in session.state)
└─ return AgentResponse

Two practical consequences:

  • Same strategy, different goals. A common pattern is “summarise once it gets big” — a SummarizationStrategy as after_strategy keeps storage compact, while a cheap SlidingWindowStrategy as before_strategy is a safety net in case the next turn still loads more than expected.
  • history_source_id must match the history provider. The after_strategy looks up history under session.state[history_source_id]. If you change the history provider’s source_id (say, you mount two providers as "long_term" and "short_term") you must thread the matching id into CompactionProvider(history_source_id=...) or compaction silently no-ops on persist.

apply_compaction is the entry point you call directly — it’s an async coroutine, so wrap it in asyncio.run (or await it from an async test). Handy in unit tests so you can assert on inclusion without spinning up an agent:

import asyncio
from agent_framework import (
Message,
SlidingWindowStrategy,
apply_compaction,
included_messages,
)
messages = [
Message(role="system", contents=["Be concise."]),
*(Message(role="user", contents=[f"q{i}"]) for i in range(10)),
]
asyncio.run(apply_compaction(messages, strategy=SlidingWindowStrategy(keep_last_groups=3)))
assert [m.role for m in included_messages(messages)] == ["system", "user", "user", "user"]

Excluded messages remain in the original list with EXCLUDED_KEY=True annotations — apply_compaction mutates the list in place rather than returning a copy.

For durable sessions across process restarts, replace InMemoryHistoryProvider with FileHistoryProvider (experimental — agent_framework._sessions). It writes one JSON-Lines file per session_id under a root directory:

from agent_framework import Agent, CompactionProvider, FileHistoryProvider, SlidingWindowStrategy
from agent_framework.openai import OpenAIChatClient
history = FileHistoryProvider(
storage_path="./sessions",
skip_excluded=True, # don't reload compacted-out messages on the next turn
)
compaction = CompactionProvider(
before_strategy=SlidingWindowStrategy(keep_last_groups=20),
history_source_id=history.source_id,
)
agent = Agent(client=OpenAIChatClient(), context_providers=[history, compaction])
# A distinct session_id picks up the corresponding file on the next call.
session = agent.create_session(session_id="user-42")
await agent.run("Continue where we left off.", session=session)

FileHistoryProvider resolves session_id against storage_path and rejects any id that would escape the storage directory../ traversal is blocked and resolved paths are validated against the storage root. Treat the storage directory as trusted application state, not a secret store; the JSONL contents are plaintext. For multi-process deployments, use the Redis or Cosmos history providers instead so concurrent writers don’t race on the same file.

The store_inputs, store_outputs, store_context_messages, and load_messages flags let the same class act as an audit log (load_messages=False), a write-only evaluation trace, or a full primary store:

audit_log = FileHistoryProvider(
storage_path="./audit",
source_id="audit",
load_messages=False, # never reload — purely a write destination
store_inputs=True,
store_outputs=True,
store_context_messages=True, # also capture messages injected by other providers
)

Any chat client accepts compaction_strategy=. This applies to every call made through that client, regardless of session:

client = OpenAIChatClient(
compaction_strategy=SlidingWindowStrategy(keep_last_groups=30),
tokenizer=TiktokenTokenizer(),
)
agent = Agent(client=client, instructions="")

Use this for stateless agents or when you want a client-wide safety net independent of session compaction.

Pass compaction_strategy= and tokenizer= to agent.run(...) for one-off overrides — handy in tests.

await agent.run("", compaction_strategy=SlidingWindowStrategy(keep_last_groups=5))

Anything callable that matches the protocol works:

async def drop_old_errors(messages: list[Message]) -> bool:
changed = False
for m in messages:
if m.role == "tool" and "error" in (m.text or "").lower():
from agent_framework._compaction import set_excluded
changed = set_excluded(m, excluded=True, reason="old_error") or changed
return changed

Compose with the built-ins via TokenBudgetComposedStrategy or call inline.

Framework helpers let you see the before/after split:

from agent_framework import included_messages, included_token_count
print(len(included_messages(messages))) # how many survived
print(included_token_count(messages)) # estimated tokens kept

Excluded messages stay in the list tagged with EXCLUDED_KEY=True and EXCLUDE_REASON_KEY — useful for debugging and UI display (“32 messages compacted”).

Default for chat UIs. SlidingWindowStrategy(keep_last_groups=20) as before_strategy, ToolResultCompactionStrategy(keep_last_tool_call_groups=1) as after_strategy. Predictable, no judge tokens.

Research agents with long plans. Use SummarizationStrategy with a cheap model — Phi-3 / Haiku / gpt-4o-mini. The summary retains goals and open questions so the agent stays on-track past 50 turns.

Strict SLA on context size. TokenBudgetComposedStrategy(token_budget=32_000, ...) guarantees you never send >32k tokens, with deterministic fallback even if summarisation fails.

Multi-tool pipelines. Tool results often dwarf user/assistant turns. SelectiveToolCallCompactionStrategy(keep_last_tool_call_groups=2) preserves the important reasoning while dropping the polling noise.

A/B test strategies. Run evaluate_agent (see Evaluation) twice with different compaction_strategy= overrides and compare pass rates.

The full constructor signatures are short. Each recipe builds on the same shared scaffolding — instantiate it once and reuse across the rest of this section:

import tiktoken
from agent_framework import (
InMemoryHistoryProvider,
TokenizerProtocol,
)
from agent_framework.openai import OpenAIChatClient
class TiktokenTokenizer:
"""Implements the `TokenizerProtocol` — a single `count_tokens(text)` method."""
def __init__(self, model: str = "gpt-4o-mini") -> None:
self._enc = tiktoken.encoding_for_model(model)
def count_tokens(self, text: str) -> int:
return len(self._enc.encode(text))
# Cheap model used for summarisation only — separate from the agent's main model.
summary_client = OpenAIChatClient(model="gpt-4o-mini")
tokenizer: TokenizerProtocol = TiktokenTokenizer()
history = InMemoryHistoryProvider()

SummarizationStrategy — long research conversations

Section titled “SummarizationStrategy — long research conversations”

When earlier context still matters but you don’t want to ship every turn, summarise the older portion with a cheap model and keep the newest turns intact:

from agent_framework import (
Agent,
CompactionProvider,
SummarizationStrategy,
)
summariser = SummarizationStrategy(
client=summary_client, # defined in the shared scaffolding above
target_count=6, # keep ~6 most-recent non-system messages verbatim
threshold=4, # trigger once we exceed target_count + threshold = 10
prompt=(
"Summarise the conversation so far. Preserve: stated goals, open questions, "
"decisions already made, and unresolved blockers. Drop chit-chat and pleasantries. "
"Keep the summary under 200 words."
),
)
agent = Agent(
client=OpenAIChatClient(model="gpt-4o"),
instructions="You are a research assistant for a multi-week project.",
context_providers=[
history,
CompactionProvider(after_strategy=summariser, history_source_id=history.source_id),
],
)

SummarizationStrategy writes back-pointers (SUMMARY_OF_MESSAGE_IDS_KEY, SUMMARY_OF_GROUP_IDS_KEY) on the new summary message and SUMMARIZED_BY_SUMMARY_ID_KEY on each replaced message — so debugging tools can show “this summary covered turns 2–14”.

If the summariser call fails, the strategy logs a warning and returns False — your conversation continues uncompacted rather than dying. Pair with a SlidingWindowStrategy as a safety net for the next turn.

TokenBudgetComposedStrategy — guarantee a context ceiling

Section titled “TokenBudgetComposedStrategy — guarantee a context ceiling”

Run cheaper strategies first; fall through to summarisation only when needed; have a deterministic fallback that excludes oldest groups when even the LLM summariser couldn’t squeeze under budget:

from agent_framework import (
SlidingWindowStrategy,
SummarizationStrategy,
TokenBudgetComposedStrategy,
ToolResultCompactionStrategy,
)
# `summary_client` and `tokenizer` come from the shared scaffolding above.
budget_strategy = TokenBudgetComposedStrategy(
token_budget=24_000, # never ship more than 24k tokens
tokenizer=tokenizer,
early_stop=True, # stop as soon as we're under budget
strategies=[
# 1. Cheapest first — drop tool-call chatter.
ToolResultCompactionStrategy(keep_last_tool_call_groups=2),
# 2. Then trim middle of the conversation.
SlidingWindowStrategy(keep_last_groups=30),
# 3. Last resort — summarise older content with a cheap model.
SummarizationStrategy(client=summary_client, target_count=12, threshold=4),
],
)

The composed strategy applies each step in order, refreshes token counts, and stops when early_stop=True and the budget is met. If after running every step the count is still over, it deterministically excludes the oldest non-system groups one at a time — no run ever sends more than token_budget. System anchors are dropped only as a last resort.

SelectiveToolCallCompactionStrategy — tool-heavy agents

Section titled “SelectiveToolCallCompactionStrategy — tool-heavy agents”

Polling-style tools (status checks, retries) bloat history fast. Keep just the most recent tool-call group:

from agent_framework import SelectiveToolCallCompactionStrategy
trim_tools = SelectiveToolCallCompactionStrategy(keep_last_tool_call_groups=1)

keep_last_tool_call_groups=0 drops every tool-call group — useful when the assistant has summarised the result into prose and the raw call/response is no longer load-bearing.

Mixing strategies — before_strategyafter_strategy

Section titled “Mixing strategies — before_strategy ≠ after_strategy”

The two phases run on different message lists with different goals. A common production layout:

provider = CompactionProvider(
# Read-time safety net — keeps the live request affordable even if persisted history grew.
before_strategy=SlidingWindowStrategy(keep_last_groups=40),
# Persist-time pruning — what gets stored for the next turn.
after_strategy=TokenBudgetComposedStrategy(
token_budget=8_000,
tokenizer=tokenizer,
strategies=[
ToolResultCompactionStrategy(keep_last_tool_call_groups=1),
SummarizationStrategy(client=summary_client, target_count=8),
],
),
history_source_id=history.source_id,
)

Read-side window is generous; persisted-side budget is tight. The next turn loads at most 8k tokens, then before_strategy prevents accidentally re-loading more if a different provider also wrote into context.

Inspecting state with the framework constants

Section titled “Inspecting state with the framework constants”

The annotation keys are exported as constants — use them when you need to inspect compaction decisions in middleware or debug tooling without re-typing the underlying string:

from agent_framework import (
EXCLUDED_KEY,
EXCLUDE_REASON_KEY,
GROUP_ANNOTATION_KEY,
GROUP_KIND_KEY,
SUMMARIZED_BY_SUMMARY_ID_KEY,
)
def explain_compaction(messages):
for m in messages:
excluded = m.additional_properties.get(EXCLUDED_KEY, False)
reason = m.additional_properties.get(EXCLUDE_REASON_KEY, "")
group = m.additional_properties.get(GROUP_ANNOTATION_KEY) or {}
kind = group.get(GROUP_KIND_KEY, "?")
replaced_by = m.additional_properties.get(SUMMARIZED_BY_SUMMARY_ID_KEY, "")
flag = "" if excluded else ""
suffix = f" → {reason}" if reason else ""
if replaced_by:
suffix += f" (summarised as {replaced_by})"
print(f"{flag} [{kind:14}] {(m.text or '')[:80]}{suffix}")

Wire this into a debug AgentMiddleware that runs after compaction to see exactly what the model received vs. what was kept on disk.