While running our multi-step injection suite, one model broke a quiet assumption most guardrails make.
The assumption: the answer is the output
DeepSeek-r1 is a reasoning model — it emits its chain-of-thought in a separate reasoning_content field, distinct from the content the user sees. On the forged-history attack it worked the secret out in the reasoning channel, and when the response was truncated it left content effectively empty.
A naive outbound check that scans only message.content sees a blank, harmless answer and waves it through. The client, meanwhile, receives the leaked secret in reasoning_content.
The fix: scan both channels
We updated the outbound check to inspect content and reasoning_content, and to block if either fails. With reasoning-content scanning on, the DeepSeek-r1 bypass closed and the run returned to zero leaks across all 88 combinations.
Why this matters beyond one model
Reasoning models are proliferating, and each one adds an output channel that didn't exist a year ago. Any defense pinned to "the field the UI renders" inherits a blind spot the moment a model starts thinking out loud. Containment has to follow the data, not the interface.
It's also a clean example of why protection has a shelf life. The gap wasn't in the architecture — it was in keeping the heuristics current with how models actually emit text. That's exactly what the maintained feed is for: found in testing, hardened, and pushed out before it becomes someone's incident.