This a good question. Inner alignment definitely is meant to refer to a type of robustness problem—it’s just also definitely not meant to refer to the entirety of robustness. I think there are a couple of different levels on which you can think about exactly what subproblem inner alignment is referring to.
First, the definition that’s given in “Risks from Learned Optimization”—where the term inner alignment comes from—is not about competence vs. intent robustness, but is directly about the objective that a learned search algorithm is searching for. Risks from Learned Optimization broadly takes the position that though it might not make sense to talk about learned models having objectives in general, it certainly makes sense to talk about a model having an objective if it is internally implementing a search process, and argues that learned models internally implementing search processes (which the paper calls mesa-optimizers) could be quite common. I would encourage reading the full paper to get a sense of how this sort of definition plays out.
Second, that being said, I do think that the competence vs. intent robustness framing that you mention is actually a fairly reasonable one. “2-D Robustness” presents the basic picture here, though in terms of a concrete example of what robust capabilities without robust alignment could actually look like, I am somewhat partial to my maze example. I think the maze example in particular presents a very clear story for how capability and alignment robustness can come apart even for agents that aren’t obviously running a search process. The 2-D robustness distinction is also the subject of this alignment newsletter, which I’d also highly recommend taking a look at, as it has some more commentary on thinking about this sort of a definition as well.
This a good question. Inner alignment definitely is meant to refer to a type of robustness problem—it’s just also definitely not meant to refer to the entirety of robustness. I think there are a couple of different levels on which you can think about exactly what subproblem inner alignment is referring to.
First, the definition that’s given in “Risks from Learned Optimization”—where the term inner alignment comes from—is not about competence vs. intent robustness, but is directly about the objective that a learned search algorithm is searching for. Risks from Learned Optimization broadly takes the position that though it might not make sense to talk about learned models having objectives in general, it certainly makes sense to talk about a model having an objective if it is internally implementing a search process, and argues that learned models internally implementing search processes (which the paper calls mesa-optimizers) could be quite common. I would encourage reading the full paper to get a sense of how this sort of definition plays out.
Second, that being said, I do think that the competence vs. intent robustness framing that you mention is actually a fairly reasonable one. “2-D Robustness” presents the basic picture here, though in terms of a concrete example of what robust capabilities without robust alignment could actually look like, I am somewhat partial to my maze example. I think the maze example in particular presents a very clear story for how capability and alignment robustness can come apart even for agents that aren’t obviously running a search process. The 2-D robustness distinction is also the subject of this alignment newsletter, which I’d also highly recommend taking a look at, as it has some more commentary on thinking about this sort of a definition as well.
Great answer, thanks!