I guess I’m more interested in modeling cases of AI importance/risk on account of pursuit of convergent instrumental goals. If there’s a setting without convergent instrumental goals, there might be a generalization, but it’s less clear.
With single-turn question answering, one could ask about accuracy, which would rate a system giving intentionally wrong answers as un-intelligent. The thing I meant to point to with AIXI is that it would be nice to have a measure of intelligence for mis-aligned systems (rather than declaring them un-intelligent because they don’t satisfy the objective like reward maximization / question answering / etc). If there is a possible “intelligent misalignment” in the single turn case (e.g. question answering) then there might be a corresponding intelligence metric that accounts for intelligent misaligned systems.
I guess I’m more interested in modeling cases of AI importance/risk on account of pursuit of convergent instrumental goals. If there’s a setting without convergent instrumental goals, there might be a generalization, but it’s less clear.
With single-turn question answering, one could ask about accuracy, which would rate a system giving intentionally wrong answers as un-intelligent. The thing I meant to point to with AIXI is that it would be nice to have a measure of intelligence for mis-aligned systems (rather than declaring them un-intelligent because they don’t satisfy the objective like reward maximization / question answering / etc). If there is a possible “intelligent misalignment” in the single turn case (e.g. question answering) then there might be a corresponding intelligence metric that accounts for intelligent misaligned systems.