Outer alignment seems to be defined as models/AI systems that are optimizing something that is very close or identical to what they were programmed to do or the humans desire. Inner alignment seems to relate more to the goals/aims of a delegated optimizer that an AI system spawns in order to solve the problem it is tasked with.
This is not how I (or most other alignment researchers, I think) usually think about these terms. Outer alignment means that your loss function describes something that’s very close to what you actually want. In other words, it means SGD is optimizing for the right thing. Inner alignment means that the model you train is optimizing for its loss function. If your AI systems creates new AI systems itself, maybe you could call that inner alignment too, but it’s not the prototypical example people mean.
In particular, “optimizing something that is very close [...] to what they were programmed to do” is inner alignment the way I would use those terms. An outer alignment failure would be if you “program” the system to do something that’s not what you actually want (though “programming” is a misleading word for ML systems).
[Ontology identification] seems neither necessary nor sufficient [… for] safe future AI systems
FWIW, I agree with this statement (even though I wrote the ontology identification post you link and am generally a pretty big fan of ontology identification and related framings). Few things are necessary or sufficient for safe AGI. In my mind, the question is whether something is a useful research direction.
It does not seem necessary for AI systems to do this in order to be safe. This is trivial as we already have very safe but complex ML models that work in very abstract spaces.
This seems to apply to any technique we aren’t already using, so it feels like a fully general argument against the need for new safety techniques. (Maybe you’re only using this to argue that ontology identification isn’t strictly necessary, in which case I and probably most others agree, but as mentioned above that doesn’t seem like the key question.)
More importantly, I don’t see how “current AI systems aren’t dangerous even though we don’t understand their thoughts” implies “future more powerful AGIs won’t be dangerous”. IMO the reason current LLMs aren’t dangerous is clearly their lack of capabilities, not their amazing alignment.
Thanks for the comment Erik (and taking the time to read the post).
I generally agree with you re: the inner/outer alignment comment I made. But the language I used and that others also use continues to be vague; the working def for inner-alignment on lesswrong.com is whether an “optimizer is the production of an outer aligned system, then whether that optimizer is itself aligned”. I see little difference—but I could be persuaded otherwise.
My post was meant to show that it’s pretty easy to find significant holes in some of the most central concepts researched now. This includes eclectic, but also mainstream research including the entire latent-knowledge approach which seems to make significant assumptions about the relationship between human decision making or intent and super-human AGIs. I work a lot on this concept and hold (perhaps too) many opinions.
The tone might not have been ideal due to time limits. Sorry if that was off putting.
I was also trying to make the point that we do not spend enough time shopping our ideas around with especially basic science researchers before we launch our work. I am a bit guilty of this. And I worry a lot that I’m actually contributing to capabilities research rather than long-term AI-safety. I guess in the end I hope for a way for AI-safety and science researchers to interact more easily and develop ideas together.
This is not how I (or most other alignment researchers, I think) usually think about these terms. Outer alignment means that your loss function describes something that’s very close to what you actually want. In other words, it means SGD is optimizing for the right thing. Inner alignment means that the model you train is optimizing for its loss function. If your AI systems creates new AI systems itself, maybe you could call that inner alignment too, but it’s not the prototypical example people mean.
In particular, “optimizing something that is very close [...] to what they were programmed to do” is inner alignment the way I would use those terms. An outer alignment failure would be if you “program” the system to do something that’s not what you actually want (though “programming” is a misleading word for ML systems).
FWIW, I agree with this statement (even though I wrote the ontology identification post you link and am generally a pretty big fan of ontology identification and related framings). Few things are necessary or sufficient for safe AGI. In my mind, the question is whether something is a useful research direction.
This seems to apply to any technique we aren’t already using, so it feels like a fully general argument against the need for new safety techniques. (Maybe you’re only using this to argue that ontology identification isn’t strictly necessary, in which case I and probably most others agree, but as mentioned above that doesn’t seem like the key question.)
More importantly, I don’t see how “current AI systems aren’t dangerous even though we don’t understand their thoughts” implies “future more powerful AGIs won’t be dangerous”. IMO the reason current LLMs aren’t dangerous is clearly their lack of capabilities, not their amazing alignment.
Thanks for the comment Erik (and taking the time to read the post).
I generally agree with you re: the inner/outer alignment comment I made. But the language I used and that others also use continues to be vague; the working def for inner-alignment on lesswrong.com is whether an “optimizer is the production of an outer aligned system, then whether that optimizer is itself aligned”. I see little difference—but I could be persuaded otherwise.
My post was meant to show that it’s pretty easy to find significant holes in some of the most central concepts researched now. This includes eclectic, but also mainstream research including the entire latent-knowledge approach which seems to make significant assumptions about the relationship between human decision making or intent and super-human AGIs. I work a lot on this concept and hold (perhaps too) many opinions.
The tone might not have been ideal due to time limits. Sorry if that was off putting.
I was also trying to make the point that we do not spend enough time shopping our ideas around with especially basic science researchers before we launch our work. I am a bit guilty of this. And I worry a lot that I’m actually contributing to capabilities research rather than long-term AI-safety. I guess in the end I hope for a way for AI-safety and science researchers to interact more easily and develop ideas together.