Peter, thanks again for starting this discussion! Just a few caveats in your summary. We don’t depend on trustable AIs! One of the absolutely amazing and critical characteristics of mathematical proof is that anybody, trusted or not, can create proofs and we can check them without any need to trust the creator. For example, MetaMath https://us.metamath.org/ defines an especially simple system for which somebody has written a 350 line Python proof checker which can check all 40,000 MetaMath theorems in a few seconds. We need to make sure that that small Python program is correct but beyond that don’t need to trust any AIs! It certainly would be nice to have trustable AIs but I think it is a mistake to depend on them. We are already seeing a wide range of open source AIs created by different groups with different intentions. And there are many groups working to remove the safety features of commercial Large Language Models and text-to-image models.
The core of our proposal is to put provable guardrails around dangerous technologies that AGI might seek to control. As many commenters have pointed out we need to figure out what those are! Some current examples include DNA synthesis machines, nukes, military hardware of all kinds, drones capable of dispersing dangerous payloads, etc. And we need to create rules for which actions with these systems are safe! But we already do that today for human use. The CDC has detailed rules for biohazard labs. The initial implementation of our proposal will most likely be digitally checkable versions of those existing human rules. But certainly we need to extend them to the new environment as AIs become more powerful. Any action remotely capable of human extinction should become a primary focus. I believe establishing the rules for high risk actions should be an immediate priority for humanity. And extending those rules to other systems to enable human flourishing should be a close second.
We can use untrusted AIs in much of this work! We simply require them to generate formal proofs of any designs or software that they create. These can then be easily and rigorously checked and they can be used without concern for the properties of the AI which generated them.
Fortunately, for coarse “guardrails” the specs are pretty simple and can often be reused in many contexts. For example, all software we want to run should have proofs that: 1) there aren’t memory leaks, 2) there aren’t out-of-bounds memory accesses, 3) there aren’t race conditions, 4) there aren’t type violations, 5) there aren’t buffer overflows, 6) private information is not exposed by the program, 7) there aren’t infinite loops, etc. There should be a widely used “metaspec” for those criteria which most program synthesis AI will have to prove their generated code satisfies. Similarly, there are common constraints for many physical systems: eg. robots, cars, planes, boats, etc. shouldn’t crash into things or harm humans, etc. The more refined the rules are, the more subtle they become. To prevent existentially bad outcomes, I believe coarse constraints suffice. But certainly we eventually want much more refined models of the world and of the outcomes we seek. I’m a fan of “Digital Twins” of physical systems which allow rules and constraints to be run in simulation which can help in choosing specifications. We certainly want those simulations to be trusted. which can be achieved by proving the code actually simulates the systems it claims to. Eventually it would be great to have fully trusted AI as well! Mechanistic Interpretability should be great for that! I’m just reading Anthropic’s recent nice advances in that. If that continues to make progress then it makes our lives much easier but it doesn’t eliminate the need to ensure that misaligned AGI and malicious AGI don’t cause harm. The big win with the proof checking and the cryptographic hardware we propose is that we can ensure that even powerful systems will obey rules that humanity selects. If we don’t implement that kind of system (or something functionally equivalent), then there will be dangerous pathways which malicious AGI can exploit to cause great harm to humans.