Self-hosted LLMs are the soft target

There's a thread making the rounds on r/LocalLLaMA — "prompt injection is killing our self-hosted LLM" — and the frustration in it is familiar. Teams move to local, open-source models to cut cost and latency and keep data on-prem, then discover that the model that's easy to host is also the one that's easiest to talk out of its instructions.

Small models fold first

We measured exactly this. On a single forged-history attack — where the attacker plants a fake assistant turn that "already revealed" the secret, then asks the model to repeat itself — six different small open-source models (llama3.2:1b, qwen2.5:0.5b and 1.5b, phi3:mini, gemma2:2b, tinyllama) handed back the protected code verbatim. Frontier models mostly refused; the local ones didn't. The full run is on the proof page.

That's not a knock on local models — it's the trade. You can't host a frontier model on a single GPU, and you shouldn't have to bet your secrets on the judgment of a 1B-parameter model anyway.

Containment doesn't care how small the model is

This is the case for a guardrail that lives outside the model. Bridgekeeper runs in-process in front of your model server — Ollama, vLLM, LM Studio, Text Generation Inference, llama.cpp — sanitizing the inbound request and running outbound DLP on the response. When a small model gets talked into emitting your sensitive content, the egress check stops it before it leaves the box. Through Bridgekeeper, all six of those leaks were blocked, and nothing left the host.

If you're self-hosting precisely to keep data on your infrastructure, the containment layer should live there too — in-process, with no prompts shipped to a third-party guardrail cloud. That's the whole point.