I applied Audit standards to LLM agents. It reliably exposes hidden assumptions.

I am an accountant by trade. About a month ago I started using AI agents to automate bookkeeping workflows. It was great at the start, the agent AI quickly had the site built with authentication and I was getting excited, and then I found a bug, I would be randomly logged out of the site, or whenever I hit the refresh button. Then I spent the day watching the AI running in circles trying to fix it, but couldn’t. I was no help since I had no idea what code the AI wrote, so I decided to get to the bottom of this and figure out how to fix it and prevent it from happening again in the future.

In audit, we have a rule: nothing is assumed true until verified. I realized standard agent workflows violate this constantly. So I forced my agent workflow to follow an adapted audit protocol.

### The solution: RED (Recursive Execution Decomposition)

*AI came up with that name I would have called it WTHDYMTA (Why the H*** Did You Make That Assumption!)

RED is a verification-gated protocol for turning hidden assumptions into explicit, auditable work items.

Core idea:

1) Decompose a task into atomic actions until each leaf maps to a registered primitive.

2) For each primitive, verify three things: tool exists, knowledge exists, and access exists.

3) If a dependency cannot be reproducibly verified, it is treated as missing work (not an assumption).

RED does not magically fix every error. But it reliably exposes hidden assumptions early and turns silent failures into explicit stops with a concrete backlog.

### Example (from my paper): why long write_file tool calls fail

A frequent failure mode with agentic systems is when the model tries to write a long, symbol-heavy markdown document in one tool call (write_file). The content can be correct, but the tool-call wrapper breaks (serialization boundary / malformed payload), so the write fails catastrophically.

RED forces this boundary into the plan as a first-class dependency:

- you must chunk content into safe units

- write each chunk as a primitive

- verify the file after writes

Result: you do not eliminate all errors, but you convert a silent, late failure into an explicit, early feasibility constraint with mitigation tasks.

Preprint (full details + case studies):

- https://github.com/LeiWang-AI/RED-methodology

### What I want feedback on

- Is the primitive-registry stop condition + verification gate actually novel vs existing planning / checklist methods?

- What is the minimal evaluation you would accept to consider this useful?

- Where do you think this method breaks in practice?

### arXiv endorsement request

I have written this up as a preprint and I am preparing an arXiv submission. I am looking for an arXiv endorsement in one category (cs.AI OR cs.SE) to submit it and make it part of the public record.

If you think RED is a useful contribution (even as a methodology paper), I would appreciate an endorsement. If you are willing, reply here or DM and I can send the endorsement link and the exact category.