Amusingly, just after I posted this, Anthropic released “Teaching Claude why” which has a bunch of information on how they behaviorally align their AIs (or at least how they iterate on particular behaviors/properties). (Though this doesn’t seem close to sufficient for a reasonably complete understanding.)
LLM-agents are sorta good at tasks when given a lot of examples of those tasks, including in some OOD related tasks. Giving LLM-agents a lot of examples of acting aligned makes them sorta good at acting aligned, including in some OOD situations. Treating alignment as a task is partially effective, is my takeaway. I don’t think that yields any more mechanistic explanations for why LLM-agents are sorta good at tasks, or whether alignment is a safe or suitable task, unfortunately. Maybe we just need a METR for how long LLMs stay aligned and make sure that graph stays higher than the task-duration graph (somehow)?
That sounds appealing if almost nothing ever changes, but we should expect discontinuities due to each or any of:
The emergence of a new architecture / training regime
Gaining the ability to self-modify
The results of self-modification
Gaining the ability to permanently elude oversight
The consequences of acting at length with no oversight
Gaining the ability to disempower humanity
Achieving a deep and unrecognizable understanding of any one of many concepts important to humans (ontological shift)
We just don’t understand any of this stuff. Even a perfectly benevolent superintelligent mind may be one digital prion from wiping us out, and it would be great to not get into that situation in the first place.
Amusingly, just after I posted this, Anthropic released “Teaching Claude why” which has a bunch of information on how they behaviorally align their AIs (or at least how they iterate on particular behaviors/properties). (Though this doesn’t seem close to sufficient for a reasonably complete understanding.)
LLM-agents are sorta good at tasks when given a lot of examples of those tasks, including in some OOD related tasks. Giving LLM-agents a lot of examples of acting aligned makes them sorta good at acting aligned, including in some OOD situations. Treating alignment as a task is partially effective, is my takeaway. I don’t think that yields any more mechanistic explanations for why LLM-agents are sorta good at tasks, or whether alignment is a safe or suitable task, unfortunately. Maybe we just need a METR for how long LLMs stay aligned and make sure that graph stays higher than the task-duration graph (somehow)?
That sounds appealing if almost nothing ever changes, but we should expect discontinuities due to each or any of:
The emergence of a new architecture / training regime
Gaining the ability to self-modify
The results of self-modification
Gaining the ability to permanently elude oversight
The consequences of acting at length with no oversight
Gaining the ability to disempower humanity
Achieving a deep and unrecognizable understanding of any one of many concepts important to humans (ontological shift)
We just don’t understand any of this stuff. Even a perfectly benevolent superintelligent mind may be one digital prion from wiping us out, and it would be great to not get into that situation in the first place.