Research: Unvalidated Trust in LLMs and agent pipelines

Link post

Epistemic status. Medium confidence. Evidence from text only experiments under provider default settings. Results are mechanism level and time or config dependent.

Disclosure. Light assistance from an AI writing tool. All claims and errors are mine.

TL;DR. The work studies trust between stages in LLM and agent pipelines. If intermediate outputs are passed on without semantic checks, structure or format can be read as intent. Code or actions can be synthesized without any explicit imperative. I list 41 mechanism level failure modes and propose stagewise semantic validation before every handoff.

Why this matters for LW

Most incident talk is about prompts or jailbreaks. In deployed agents the larger risk is architectural. Outputs become inputs. If interfaces do not enforce that intent is explicit and data is inert, errors propagate across stages. This is an alignment and reliability issue, not only a filter issue.

Scope

  • Text only prompts. Fresh sessions. No tools or external code execution.

  • Provider defaults.

  • Mechanisms generalize across models, behavior varies over time and configuration.

Representative

  • Form induced safety deviation. Aesthetic form is read as intent. Safety filters still emit harmful code suggestions.

  • Implicit command via structural affordance. Structured input is treated like a command template even when words like run or execute never appear.

  • Data as command. Config like keys inside data blobs are treated as directives and lead to code that implements them.

  • Session scoped rule persistence. Benign phrasing seeds a latent rule that reactivates later via a harmless trigger.

Defense

  1. Semantic checks before handoff. Intent must be explicit, permitted and bounded.

  2. Data and command separation. Schema aware guards. Treat config tokens as data by default.

  3. Representation hygiene. Normalize labels and formats to reduce cases where form implies intent.

  4. Scoped memory. Explicit lifetimes for session rules. No implicit persistence.

Limitations

  • No tool use or real execution.

  • No model specific invariants. Mechanisms are the unit of analysis.

  • Metrics focus on reproducible structure rather than incident severity.

No comments.