Here are some important-seeming properties to illustrate what I mean:
Robustness of value-alignment: Modern LLMs can display a relatively high degree of competence when explicitly reasoning about human morality. In order for it to matter for RSI, however, those concepts need to also appropriately come into play when reasoning about seemingly unrelated things, such as programming. The continued ease of jailbreaking AIs serves to illustrate this property failing (although solving jailbreaking would not necessarily get at the whole property I am pointing at).
Propagation of beliefs: When the AI knows something, it should know it in a way which integrates well with everything else it knows, rather than easily displaying the knowledge in one context while seeming to forget it in another.
Preference for reasons over rationalizations: An AI should be ready and eager to correct its mistakes, rather than rationalizing its wrong answers. It should be truth-seeking, following thoughts where they lead instead of planning ahead to justify specific answers. It should prefer to valid proof steps over arriving at an answer when the two conflict.
Knowing the limits of its knowledge: Metacognitive awareness of what it knows and what it doesn’t know, appropriately brought to bear in specific situations. The current AI paradigm just has one big text-completion probability distribution, so there’s not a natural way for it to distinguish between uncertainty about the underlying facts and uncertainty about what to say next—hence we get hallucinations.
All of this is more-or-less a version of the metaphilosophy research agenda, framed in terms of current events in AI.
I very much like the concreteness here.
I consider these sorts of things just fundamental epistemic problems, or basic skills that good researchers should have. All superforecasters should be very familiar with issues 2-4, and most probably couldn’t define metaphilosophy. I don’t see the need to be fancy about it.
On that note, I’ll hypothesize that if we were to make benchmarks for any of these items, it would be fairly doable to make AIs that do better than humans on them, then later we could achieve greater and greater measures. I have a hard time imagining tests here that I would feel confident would not get beaten, if there was sufficient money on the line, in the next year or two.
I very much like the concreteness here.
I consider these sorts of things just fundamental epistemic problems, or basic skills that good researchers should have. All superforecasters should be very familiar with issues 2-4, and most probably couldn’t define metaphilosophy. I don’t see the need to be fancy about it.
On that note, I’ll hypothesize that if we were to make benchmarks for any of these items, it would be fairly doable to make AIs that do better than humans on them, then later we could achieve greater and greater measures. I have a hard time imagining tests here that I would feel confident would not get beaten, if there was sufficient money on the line, in the next year or two.