Thanks for clarifying. It still seems that we’d encounter the same sort of problem even in the short term, though? Take the case of a programmer hijacking the input medium. Does the AI care? It’s still getting instructions to follow. To what extent is it modeling the real humans on the other end? You touch on this in Defining the Principal(s) and jailbreaking, but it seems like it should be even more of a Problem for the approach. Like, an AI that can robustly navigate that challenge, to the point of being more or less immune to intercepts, seems hard to distinguish from one that is (a) long-term aligned as well and (b) possessed of deadly competence at world-modeling if not long-term aligned. An AI that can’t handle this problem...well, is it really intent-aligned? Where else does its understanding of the developers break down?
Thanks for clarifying. It still seems that we’d encounter the same sort of problem even in the short term, though? Take the case of a programmer hijacking the input medium. Does the AI care? It’s still getting instructions to follow. To what extent is it modeling the real humans on the other end? You touch on this in Defining the Principal(s) and jailbreaking, but it seems like it should be even more of a Problem for the approach. Like, an AI that can robustly navigate that challenge, to the point of being more or less immune to intercepts, seems hard to distinguish from one that is (a) long-term aligned as well and (b) possessed of deadly competence at world-modeling if not long-term aligned. An AI that can’t handle this problem...well, is it really intent-aligned? Where else does its understanding of the developers break down?