Gurkenglas comments on Measuring intelligence and reverse-engineering goals

Gurkenglas 11 Aug 2025 19:17 UTC
2 points
0
The parenthetical was meant to argue that almost all games are single-turn, that multi-turn games are single-turn games with extra property. I have found several definitions that can be gainfully refactored in terms of the theorems they lead to, but never did they thereby become less general.
Mayyybe you can get away with claiming that the thing you’re looking for only typechecks for multi-turn games, in which case I’d go looking for the definitions that become available there. Namely, hmm. For what properties of world-states does there exist a player such that the property remains true over time? Which players does one get that way?
- jessicata 11 Aug 2025 23:14 UTC
  2 points
  0
  Parent
  I guess I’m more interested in modeling cases of AI importance/risk on account of pursuit of convergent instrumental goals. If there’s a setting without convergent instrumental goals, there might be a generalization, but it’s less clear.
  
  With single-turn question answering, one could ask about accuracy, which would rate a system giving intentionally wrong answers as un-intelligent. The thing I meant to point to with AIXI is that it would be nice to have a measure of intelligence for mis-aligned systems (rather than declaring them un-intelligent because they don’t satisfy the objective like reward maximization / question answering / etc). If there is a possible “intelligent misalignment” in the single turn case (e.g. question answering) then there might be a corresponding intelligence metric that accounts for intelligent misaligned systems.