ryan_greenblatt comments on Davidad’s Bold Plan for Alignment: An In-Depth Explanation

ryan_greenblatt 13 Jan 2025 23:48 UTC
LW: 33 AF: 20
18
AF
At the time when I first heard this agenda proposed, I was skeptical. I remain skeptical, especially about the technical work that has been done thus far on the agenda^[1].

I think this post does a reasonable job of laying out the agenda and the key difficulties. However, when talking to Davidad in person, I’ve found that he often has more specific tricks and proposals than what was laid out in this post. I didn’t find these tricks moved me very far, but I think they were helpful for understanding what is going on.

This post and Davidad’s agenda overall would benefit from having concrete examples of how the approach might work in various cases, or more discussion of what would be out of scope (and why this could be acceptable). For instance, how would you make a superhumanly efficient (ASI-designed) factory that produces robots while proving safety? How would you allow for AIs piloting household robots to do chores (or is this out of scope)? How would you allow for the AIs to produce software that people run on their computers or to design physical objects that get manufactured? Given that this proposal doesn’t allow for safely automating safety research, my understanding is that it is supposed to be a stable end state. Correspondingly, it is important to know what Davidad thinks can and can’t be done with this approach.

My core disagreements are on the “Scientific Sufficiency Hypothesis” (particularly when considering computational constraints), “Model-Checking Feasibility Hypothesis” (and more generally on proving the relevant properties), and on the political feasibility of paying the needed tax even if the other components work out. It seems very implausible to me that making a sufficiently good simulation is as easy as building the Large Hadron Collider. I think the objection in this comment holds up (my understanding is Davidad would require that we formally verify everything on the computer).^[2]

As a concrete example, I found it quite implausible that you could construct and run a robot factory that is provably safe using the approach outlined in this proposal, and this sort of thing seems like a minimal thing you’d need to be able to do with AIs to make them useful.
1. ↩︎
  My understanding is that most technical work has been on improving mathematical fundamentals (e.g. funding logicians and category theorists to work on various things). I think it would make more sense to try to demonstrate overall viability with minimal prototypes that address key cruxes. I expect this to fail and thus it would be better to do this earlier. If you expect these prototypes would work, then it seems interesting to demonstrate this as many people would find this very surprising.
2. ↩︎
  This is mostly unrelated, but when talking with Davidad, I’ve found that a potential disagreement is that he’s substantially more optimistic about using elicitation to make systems that currently seem quite incapable (e.g., GPT-4) very useful. As a concrete example, I think we disagreed about the viability of running a fully autonomous Tesla factory for 1 year at greater than one-tenth productivity using just AI systems created prior to halfway through 2024. (I was very skeptical.) It’s not exactly clear to me how this is a crux for the overall plan (beyond getting a non-sabotaged implementation of simulations) given that we still are aiming to prove safety either way, and proving properties of GPT-4 is not clearly much easier than proving properties of much smarter AIs. (Apologies if I’ve just forgotten.)
What links here?
- Shallow review of technical AI safety, 2025 by technicalities (17 Dec 2025 18:18 UTC; 195 points)
- Shallow review of technical AI safety, 2025 by technicalities (16 Dec 2025 10:42 UTC; 6 points)