Thanks for citing my work! I feel compelled to respond because I think you’re misunderstanding me a little.
I agree that long-term intent alignment is pretty much incoherent because people don’t have much in the way of long-term intentions. I guess the exception would be to collapse it to following intent only when it exists—when someone does form a specified intent.
In my work, intent alignment I means personal short-term intent. Which is pretty much following instructions as they were intended. That seems coherent (although not without Problems).
I use it that way because others seem to as well. Perhaps that’s because the broader use is incoherent. It usually seems to means “does what some person or limited group wants it to do” (in the short term is often implied)
The original definition of intent alignment is the broadest I know of, more-or-less doing something people want for any reason. Evan Hubinger defined it that way, although I haven’t seen that definition get much use.
Thanks for clarifying. It still seems that we’d encounter the same sort of problem even in the short term, though? Take the case of a programmer hijacking the input medium. Does the AI care? It’s still getting instructions to follow. To what extent is it modeling the real humans on the other end? You touch on this in Defining the Principal(s) and jailbreaking, but it seems like it should be even more of a Problem for the approach. Like, an AI that can robustly navigate that challenge, to the point of being more or less immune to intercepts, seems hard to distinguish from one that is (a) long-term aligned as well and (b) possessed of deadly competence at world-modeling if not long-term aligned. An AI that can’t handle this problem...well, is it really intent-aligned? Where else does its understanding of the developers break down?
Thanks for citing my work! I feel compelled to respond because I think you’re misunderstanding me a little.
I agree that long-term intent alignment is pretty much incoherent because people don’t have much in the way of long-term intentions. I guess the exception would be to collapse it to following intent only when it exists—when someone does form a specified intent.
In my work, intent alignment I means personal short-term intent. Which is pretty much following instructions as they were intended. That seems coherent (although not without Problems).
I use it that way because others seem to as well. Perhaps that’s because the broader use is incoherent. It usually seems to means “does what some person or limited group wants it to do” (in the short term is often implied)
The original definition of intent alignment is the broadest I know of, more-or-less doing something people want for any reason. Evan Hubinger defined it that way, although I haven’t seen that definition get much use.
For all of this see Conflating value alignment and intent alignment is causing confusion. I might not have been clear enough in stressing that I drop the “personal short term” but still mean it when saying intent alignment. I’m definitely not always clear enough
Thanks for clarifying. It still seems that we’d encounter the same sort of problem even in the short term, though? Take the case of a programmer hijacking the input medium. Does the AI care? It’s still getting instructions to follow. To what extent is it modeling the real humans on the other end? You touch on this in Defining the Principal(s) and jailbreaking, but it seems like it should be even more of a Problem for the approach. Like, an AI that can robustly navigate that challenge, to the point of being more or less immune to intercepts, seems hard to distinguish from one that is (a) long-term aligned as well and (b) possessed of deadly competence at world-modeling if not long-term aligned. An AI that can’t handle this problem...well, is it really intent-aligned? Where else does its understanding of the developers break down?