Joe Rogero comments on The title is reasonable

Joe Rogero 22 Sep 2025 15:41 UTC
3 points
0
By “defense-in-depth” I mean at every step, make decisions and design choices that increase the likelihood of the model “wanting” (in the book sense) to not harm (or kill) humans (or to circumvent our safeguards).
Can you give an example or three of such a decision or design choice?
In my model of the situation, the field of AI research mostly does not know how specific decisions and design choices affect the inner drives of AIs. External behavior in specific environments can be nudged around, but inner motivations largely remain a mystery. So I’m highly skeptical that researchers can exert much deliberate causal influence on the inner motivations.
A related possible-crux is that, while
An AI having and optimizing various real-world preferences is a good map for predicting its behavior in many cases.
...I don’t think it’s a good map for predicting behavior in the cases that matter most, in part because those cases tend to occur at extremes. And even if it were, to the extent that current AIs seem to be optimizing for real-world preferences at all, they don’t seem to be very nice ones; see for example the tendency to feed the delusions of psychotic people when elsewhere claiming that’s a bad thing to do.