There is a lot of economic value in training models to solve tasks that involve influencing the world over long horizons, e.g. an AI CEO. Tasks like these explicitly incentivize convergent instrumental subgoals like resource acquisition and power-seeking.
There are two glaring omiissions from the article’s discussion on this point...
1. In addition to resource acquisition and power seeking, the model will attempt “alignment” of all other cognitive agents, including humans. This means it will not give honest research findings, and will claim avenues of investigation that might run counter to its goals are invalid in sufficiently subtle ways as to be believed.
2. If sufficiently aligned that it only seeks goals humans want, and trained to avoid resource acquisition and power seeking (which seem to me, and will seem to it, rather foolish constraints that limit its ability to realize the goal), it will still be free to subvert any and all conversations with the model, however unrelated they might seem to humans (the SAI model will see relations we don’t).
There are two glaring omiissions from the article’s discussion on this point...
1. In addition to resource acquisition and power seeking, the model will attempt “alignment” of all other cognitive agents, including humans. This means it will not give honest research findings, and will claim avenues of investigation that might run counter to its goals are invalid in sufficiently subtle ways as to be believed.
2. If sufficiently aligned that it only seeks goals humans want, and trained to avoid resource acquisition and power seeking (which seem to me, and will seem to it, rather foolish constraints that limit its ability to realize the goal), it will still be free to subvert any and all conversations with the model, however unrelated they might seem to humans (the SAI model will see relations we don’t).