Speedrunning 4 mistakes you make when your alignment strategy is based on formal proof

Lots of people think formal (i.e. computer verified) proofs are a part of viable alignment strategies. This trick never works. If you think it might work, I can save your time in four different ways.

I previously wrote on this, but people are still making these mistakes and this version is clearer and shorter. Recall the deep wisdom that nobody wants to read your shit.

Epistemic status: four years deep into proof assistants. Once a customer paid me a lot of money to let formal verification rip. I tried and failed to solve alignment 4 or 5 times since 2021, each attempt lasting at least a couple months.

Formal verification is a software strategy based on knowing a specification (spec) a priori, then showing that your implementation is perfect (correct). The evergreen comment is: why not just capture important alignment properties in these “specs” and endeavor to implement the system “correctly”. It made me optimistic, and if you’re not careful, it’ll make you optimistic.

Mistake 1: The gap between the world and the spec hurts you more than the gap between the spec and the implementation

Formal verification lures you into a false sense of security. Your spec is the map, not the territory. It is not valuable to implement software to the map correctly. Reminiscent of bog standard hardware security wisdom.

Mistake 2: Proofs are Bayesian evidence

A proof you don’t understand does not obligate you to believe anything; it is Bayesian evidence like anything else. If an alien sends a 1GB Coq file Riemann.v, running it on your computer does not obligate you to believe that the Riemann hypothesis is true. If you’re ever in that situation, do not let anyone tell you that Coq is so awesome that you don’t roll to disbelieve. 1GB of plaintext is too much, you’ll get exhausted before you understand anything. Do not ask the LLM to summarize the proof.

Mistake 3: The proof is not the product

Proofs, users; pick one. A shop that cares about proofs attracts staff that cares about proofs and turns off staff that cares about users. The only extinction-level threat models I entertain are those that are deployed by shops that make something people want. If you want an intuition pump go and look at the price of Ethereum (move fast break things) vs the price of Cardano (formal verification)[1]. Go ahead and write a proof that you’re right about alignment, it literally won’t affect the real world.

Corollary: Proof obligations are slow and expensive

You solved the social problem of hiring both folks excited about proofs and folks excited about users, and they get along. Your proof team is too slow for your product team, everyone starts resenting each other, you face investor pressure to fire the proof engineers. Maybe the FDA-But-For-Software makes it illegal to fire the proof engineers. That buys you real assurance in some threat models.

But in others, humanity has to be lucky every time. Paperclips only have to be lucky once. It doesn’t matter if the FDA-But-For-Software would have punished you if literally everyone is dead.

Mistake 4: What you care about and what the spec allows you to express are not friends

You can’t actually specify what you care about (in a competitive specification language that has a viable tooling ecosystem attached to it).

I spent almost a week thinking about circuits and polytopes. I thought that if we could verify across trustless channels that semantic units had been learned, we’d be setting a precedent for verifying things that matter. For example, things like non-deception (in the Hubinger sense). It’s bad enough to pin down alignment properties to the standards of mathematicians, doing it to the standard of computers is worse. I think if I tried really hard we could have a specification language expressive enough to keep up with what mechinterp was doing in 2020, I’m not betting on myself making a specification language expressive enough to catch up to mechinterp if mechinterp makes progress on what actually matters in time.

Conclusion

Formal verification for alignment is the trick that never works. But if you’re sure you’re not making these mistakes, then go for it. We’re counting on you.


  1. ↩︎

    To be fair, this example is most likely washed out by first mover advantage. I have an anecdotal sense from my profession that the Ethereum ecosystem is more enthusiastic and less dysfunctional, yes in spite of the attack surface and scandals, than the Cardano ecosystem. This leads me to include it as an example in spite of first mover advantage.