A somewhat crazy aspect of the current situation is that we have very little confirmed public information about why frontier AIs end up being apparently behaviorally aligned. And more generally, we don’t know what factors in training are most relevant for (behavioral) alignment. Like, what interventions in training result in Anthropic AIs following the constitution or make OpenAI AIs follow the spec? What factors tend to make them follow the constitution/spec less (or cause various specific misaligned behaviors)? It’s not that hard to get an OK sense of what is roughly going on based on speculation, rumors, and non-public info, but this situation results in a much worse public understanding of alignment. I think AI companies should be much more transparent about this. (It’s presumably not in their commercial interest to do this unilaterally, and it’s not obvious that an idealized altruistic AI company should unilaterally release this information if they couldn’t get other AI companies to do the same.)
Amusingly, just after I posted this, Anthropic released “Teaching Claude why” which has a bunch of information on how they behaviorally align their AIs (or at least how they iterate on particular behaviors/properties). (Though this doesn’t seem close to sufficient for a reasonably complete understanding.)
LLM-agents are sorta good at tasks when given a lot of examples of those tasks, including in some OOD related tasks. Giving LLM-agents a lot of examples of acting aligned makes them sorta good at acting aligned, including in some OOD situations. Treating alignment as a task is partially effective, is my takeaway. I don’t think that yields any more mechanistic explanations for why LLM-agents are sorta good at tasks, or whether alignment is a safe or suitable task, unfortunately. Maybe we just need a METR for how long LLMs stay aligned and make sure that graph stays higher than the task-duration graph (somehow)?
That sounds appealing if almost nothing ever changes, but we should expect discontinuities due to each or any of:
The emergence of a new architecture / training regime
Gaining the ability to self-modify
The results of self-modification
Gaining the ability to permanently elude oversight
The consequences of acting at length with no oversight
Gaining the ability to disempower humanity
Achieving a deep and unrecognizable understanding of any one of many concepts important to humans (ontological shift)
We just don’t understand any of this stuff. Even a perfectly benevolent superintelligent mind may be one digital prion from wiping us out, and it would be great to not get into that situation in the first place.
It sure would be great if there was a well-funded “AI safety” nonprofit with the goal to be as open as possible about AI research. Anyone considered that?
We don’t know how AIs are aligned.
A somewhat crazy aspect of the current situation is that we have very little confirmed public information about why frontier AIs end up being apparently behaviorally aligned. And more generally, we don’t know what factors in training are most relevant for (behavioral) alignment. Like, what interventions in training result in Anthropic AIs following the constitution or make OpenAI AIs follow the spec? What factors tend to make them follow the constitution/spec less (or cause various specific misaligned behaviors)? It’s not that hard to get an OK sense of what is roughly going on based on speculation, rumors, and non-public info, but this situation results in a much worse public understanding of alignment. I think AI companies should be much more transparent about this. (It’s presumably not in their commercial interest to do this unilaterally, and it’s not obvious that an idealized altruistic AI company should unilaterally release this information if they couldn’t get other AI companies to do the same.)
Amusingly, just after I posted this, Anthropic released “Teaching Claude why” which has a bunch of information on how they behaviorally align their AIs (or at least how they iterate on particular behaviors/properties). (Though this doesn’t seem close to sufficient for a reasonably complete understanding.)
LLM-agents are sorta good at tasks when given a lot of examples of those tasks, including in some OOD related tasks. Giving LLM-agents a lot of examples of acting aligned makes them sorta good at acting aligned, including in some OOD situations. Treating alignment as a task is partially effective, is my takeaway. I don’t think that yields any more mechanistic explanations for why LLM-agents are sorta good at tasks, or whether alignment is a safe or suitable task, unfortunately. Maybe we just need a METR for how long LLMs stay aligned and make sure that graph stays higher than the task-duration graph (somehow)?
That sounds appealing if almost nothing ever changes, but we should expect discontinuities due to each or any of:
The emergence of a new architecture / training regime
Gaining the ability to self-modify
The results of self-modification
Gaining the ability to permanently elude oversight
The consequences of acting at length with no oversight
Gaining the ability to disempower humanity
Achieving a deep and unrecognizable understanding of any one of many concepts important to humans (ontological shift)
We just don’t understand any of this stuff. Even a perfectly benevolent superintelligent mind may be one digital prion from wiping us out, and it would be great to not get into that situation in the first place.
It sure would be great if there was a well-funded “AI safety” nonprofit with the goal to be as open as possible about AI research. Anyone considered that?