Please Measure Verification Burden
I appreciate Theodore Ehrenborg’s and Max von Hippel’s comments.
Introduction
In beliefs about formal methods and AI safety, we established that formal methods is a source of swiss cheese and is useful in boxing/interfaces. A core premise of the AI control literature is that the blue team is computationally poorer than the red team[1]. There wouldn’t be much of an interesting problem if you could spend as much on audits (whether that’s $/mtoken at inference time, raw token count at inference time, or any other notion of compute) as the untrusted model spends.
As a proof engineer and FMxAI/GSAI movement strategist, it’s important to me that formal methodsicians are presenting their work in a way that makes it most useful for various AI Security schemes it might plug into as crunch time continues to crunch on. If you’re shipping research or products in FMxAI, you can do me a huge favor pretty cheaply that I think will help your work take flight in the high stakes scenario of advanced AI deployment.
Unfortunately for me, and possibly humanity, proving a program correct is more expensive than writing the program in the first place. So if proofs are in your blue team’s arsenal, and the compliance burden falls on the blue team[2], then you lose the premise that the blue team is computationally poorer than the red team. I would like for AI security people to have available to them principled and reliable data about the degree of the budget asymmetry. I.e., if you’re using AI to prove programs correct, please publish verification burden.
Verification burden
Verification burden is a ratio measuring the cost of proving a program correct relative to the cost of implementing that program up least-effort QA. So in a hoare logic setting, under known precondition-postcondition pairs, if it costs 100 tokens for a language model agent loop to implement a program that completes the hoare triple and it costs 1000 tokens to prove the hoare triple correct, then we can report a verification burden of 10[3]. Note that you could instantiate cost at tokens, dollars spent on inference (which captures tokens but is sensitive to having multiple model sizes/classes in the system), number of toolcalls or conversation length, a more vibey unit of intelligence or cognitive effort, FLOPS or clocktime, walltime relative to a fixed CPU, LoC, or something else. I’ll use the specific case of tokens as a proxy for cognition for most of what I’m interested in, but the metric should make sense at an arbitrary/abstract notion of “cost” or “compute”.
Verification burden in humans
Project managers at proof shops like Galois or SRI have been keenly interested in verification burden in humans for decades. Unprincipled anecdotes and massive confidence intervals on napkins are the law of the land[4]. In humans, a project manager might guess “if it took the client 2 SWE-months to write this program, and we assume spec elicitation is sufficiently straightforward, we should be able to prove it correct in 2-3 postdoc-years”, and then be wrong. An organization’s hard-won intuition is the only way to assess a verification burden estimate; there is nothing scientific about this.
Verification burden in agents
Token measurements (such as the token_counter function provided by the LiteLLM API aggregator) are a great opportunity to measure verification burden in a principled way for the first time. Much less noisy than the abstract “postdoc-year”, which considers the variance across the skill and grit levels of different postdocs to be a nuisance, we can now fix a language model and study its capacity to write a proof and how much effort (i.e., how many loops through the toolcall-inference cycle) it takes to do it.
I expect[5], based on vibey/anecdotal intuition as a proof engineer and lean agent/synthetic data expert, given that I’ve only measured this for real on problems that are too toy to count:
Verification burden varies by about 0.5-1 OOMs on different problems from the same application area
Verification burden varies by about 1-2 OOMs on the average problem across different problem areas.
AI security scheme designers need to be able to evaluate the budget-realism of formal methods
“Just verify it” is not a free action. When you have the spec, program, and proof, running the proof checker should[6]be O(laptop), but the question is the cognitive investment in shipping the proof.
If verification burden is low enough, an AI security plan that leans on FM could be the cheapest reliable way to buy safety. If verification burden is high enough, then FM becomes a luxury good we can only afford on the narrowest, most safety-critical surfaces (boxing boundaries, tool interfaces, small kernels). Without principled burden measurements, scheme designers are forced into one of two failure modes:
Over-reliance: assuming verification is cheap and scaling it across large surface areas—yielding a scheme that looks crisp on paper but collapses in the real world because they can’t pay the proof bill.
Under-reliance: assuming verification is impossibly expensive and excluding it from designs where it would actually be a uniquely good deal—yielding a scheme that misses a high-leverage safety tool.
So: verification burden data is not just nice engineering metadata. It determines which AI security schemes are even coherent under realistic asymmetry assumptions.
Upfront vs ongoing cost
This is mostly to be explored in a later post[7]. But briefly, this is the question of the prover tokens being spent once (like for proving a module of an operating system) or spent ongoing (like repairing proofs as that module gets updated). Under a fixed enough problem domain, it will always be easy to drive verification burden down through one-time investment into boutique custom tactics that don’t generalize out of that problem domain. Substantial upfront cost for major runtime savings. In a kinda Big O sense, this one-time investment plays the role of a constant factor, so it may or may not be strategically important for the application you care about. To be clear, right now I think what’s important about verification burden is mostly this Big O view.
Please report verification burden
So if you’re shipping program/proof synthesis results, please report verification burden. It’s a cheap addon to your metrics. From the perspective of AI security under finite budgets, the key question is not “can we verify it in principle?” it’s “can we afford to verify it faster than the red team can break it?”. Note you may adjust the basic vibe of a verification burden to your usecase. Let’s create a science of verification burden!
Appendix: stop reading now if this is not your actual field
What should FMxAI authors report?
If you’re shipping FMxAI research or a verification-heavy product, here’s the cheap favor: publish a verification burden table alongside your results.
At minimum, please report:
Implementation cost: tokens (or cash) for an LM agent to implement the artifact up to least-effort QA, i.e. passing tests/basic sanity checks, not full proofs.
Proof cost: tokens for the LM agent to produce a correct proof (including repair loops).
Burden ratio: proof cost / implementation cost.
Confidence interval / variance across tasks: burden is spiky; averages alone mislead.
Task metadata so others can compare: language, spec style, proof assistant, automation level, model used, tool stack, stopping criteria.
How to measure verification burden in agents
A workable experimental recipe:
Fix the model and toolchain. Same model class, same temperature, same retrieval, same proof assistant version, same tactic libraries.
Run an implementation agent.
Inputs: spec and any allowed context.
Goal: produce working code with minimal QA (tests, lint, typecheck).
Record:
total prompt and completion tokens
number of tool calls / iterations
walltime if you can
Run a proof agent.
Inputs: same spec and produced code and proof environment.
Goal: proved theorem / verified artifact.
Record the same measures.
Compute burden = proof costs / implementation costs
Repeat across task suite (even small suites help), report distribution.
If you want a single scalar headline, use the median burden and also show IQR / min-max.
Papers that could’ve measured verification burden but didn’t
FVAPPS: Proving the Coding Interview
This task is to, given specs, unit tests, a function signature, and a natural language description of the function to implement the function and the proof. If I recall correctly (epistemic status: an author), when we did the baselines we had the implementation of the function under test and the proof in different loops. It would’ve been trivial for us to write down verification burden at least in terms of conversation length (I checked our logs for ways to reverse engineer some proxy, didn’t see any ways to do it because of things we failed to log).
VERINA
They chain their codegen task to their proofgen task, so they’d have the isolation of concerns required to measure the tokens required for codegen and those required for proofgen. It would’ve been easy for them to add verification burden! I’m surprised I can’t reverse engineer some proxy from the data they reported, which I might be able to do if I emailed the authors for full access to their logs.
CLEVER
CLEVER is also inferring specs at benchmark creation time (like FVAPPS), but in a much more principled way where they have a reference spec targeted and they require an isomorphism proof. Verification burden is almost already there. Their pipeline already isolates the parts you need:
Implementation cost proxy: tokens/time spent in GenerateImpl retries until pi compiles (they even separately report implementation compilation rates).
Proof cost proxy: tokens/time spent in ProveCorrectness retries until Lean accepts the proof (they separately report implementation certification / prove@k).
Spec-side burden (optional): same story for GenerateSpec vs ProveEquivalence.
Because the evaluation algorithm is staged and each stage is retried independently with a shared timeout budget, logging tokens per stage would give burden for free. They don’t log tokens, but they do give coarse difficulty indicators:
Proof lengths and times for successful proofs (Table 2). Spec-equivalence proofs are consistently longer than correctness proofs, and take longer to find when they exist.
Success rates per stage (Table 1 / Figure 6): compilation is high, proving is the bottleneck, especially for spec equivalence.
Human formalization effort: writing each non-computable spec averaged ~25 minutes plus ~15 minutes peer review; manual correctness proofs for a sample ranged from 10 to 225 lines. That’s not agent-token burden, but it’s a human burden datapoint they did measure.
Again, under emailing the authors for complete logs, I could probably reconstruct a pretty satisfying verification burden.
Verification burden in humans
HACMS has not made it easy to estimate human verification burden of their projects. SeL4, on the other hand, very much has! It’s widely cited that the proofs in seL4 were 20 person-years, and in the paper you can see that the C microkernel implementation was about 4 person-years. So you get a verification burden of 20⁄4 = 5. Space and aeronautics would also be a source of data or intuition.
I’ve updated since the first draft of this post that this is actually nonessential, but will address my fully updated view in the next post. ↩︎
This is a big if. You could easily imagine the compliance burden falling on the red team, because it’s the red team who wants their code executed in the world. Intuition pump for that world is when you provide software to federally insured banks, you have to pay for the SOC 2 auditors as a cost of doing business. ↩︎
- ^
This is the setup in my preliminary/prototype whitepaper.
But it’s unclear how architectural innovations change the game. It’s kinda primitive to be approaching proofs as a text-to-text problem, and we won’t be stuck with it for much longer if people keep paying attention to FMxAI. For example, if Higher Order Company’s thesis pays out for them. Watch this space. ↩︎
There, we will work with Manheim and Homewood’s distinction, roughly that the difference between control and oversight is the difference between upfront and runtime. ↩︎
My goal with Hypothesis is to undermine this whole approach, by making “use PBT” the max-productivity path and setting as difficult a baseline as possible. So far it’s going pretty well!
Agree that PBT is the control group! Agree that PBT is the 80⁄20 for most people most of the time.
And if you combine Hypothesis with fractional proofs, you can 80⁄20 the difference between just Hypothesis, and proofs!
The baseline for proof burden is just lines of proof / lines of code. For production-grade software verification projects this is 10×--100×.
Models that are bad at verification will do worse.
On ambitious projects (e.g., AlphaProof when it came out) verification might increase capabilities, leading to a verification burden < 1