I keep running into conceptual confusion around the term “alignment,” particularly when reading older Less Wrong posts. Some people say “aligned AI” and mean “an AI that works for human flourishing,” some people say that an AI “is aligned” if it reliably advances the intended objectives of some person or group (and doesn’t have some secret set of goals / isn’t scheming), and yet other people use “alignment” to mean something along the lines of “the ability of any system to reliably work towards some pre-defined goal.” I usually have to work out which is being said on the spot, which is annoying given that the implications of each are very different.
Is there one commonly accepted definition? Is this confusion just a thing we’ve all accepted?
You need to successfully point the AI at anything at all. (This may superficially seem like it’s working with current LLMs, but it isn’t actually anywhere close to robust enough to hold up)
You need to point the AI at some kind of nuanced abstract target, in particular, that remains stable as the AI updates its ontology.
(You also eventually need to point the AI at a cluster of messy human-value-concepts in particular. Though from what I gather, MIRI-ish people think if you get the first two things, this last part isn’t actually that hard)
An aligned AI is the one who is successfully pointed by humans to a goal. If mankind does solve alignment, then a power struggle over which goals the AI serves may have an effect on the world. Otherwise the AI pursues the goals which mankind never set, and the humans are wiped out or disempowered.
Gotcha. Is there a strong reason to assume that we’ll succeed at creating AIs that can be pointed at a single target? I read this post and comment a while back and would love your thoughts.
I keep running into conceptual confusion around the term “alignment,” particularly when reading older Less Wrong posts. Some people say “aligned AI” and mean “an AI that works for human flourishing,” some people say that an AI “is aligned” if it reliably advances the intended objectives of some person or group (and doesn’t have some secret set of goals / isn’t scheming), and yet other people use “alignment” to mean something along the lines of “the ability of any system to reliably work towards some pre-defined goal.” I usually have to work out which is being said on the spot, which is annoying given that the implications of each are very different.
Is there one commonly accepted definition? Is this confusion just a thing we’ve all accepted?
As Raemon put it,
You need to successfully point the AI at anything at all. (This may superficially seem like it’s working with current LLMs, but it isn’t actually anywhere close to robust enough to hold up)
You need to point the AI at some kind of nuanced abstract target, in particular, that remains stable as the AI updates its ontology.
(You also eventually need to point the AI at a cluster of messy human-value-concepts in particular. Though from what I gather, MIRI-ish people think if you get the first two things, this last part isn’t actually that hard)
An aligned AI is the one who is successfully pointed by humans to a goal. If mankind does solve alignment, then a power struggle over which goals the AI serves may have an effect on the world. Otherwise the AI pursues the goals which mankind never set, and the humans are wiped out or disempowered.
Gotcha. Is there a strong reason to assume that we’ll succeed at creating AIs that can be pointed at a single target? I read this post and comment a while back and would love your thoughts.