It seems like this is the sort of thing you could only ever learn by learning about the real world first
Yep. The idea is to try and get a system that develops all practically useful “theoretical” abstractions, including those we haven’t discovered yet, without developing desires about the real world. So we train some component of it on the real-world data, then somehow filter out “real-world” stuff, leaving only a purified superhuman abstract reasoning engine.
One of the nice-to-have properties here would be is if we don’t need to be able to interpret its world-model to filter out the concepts – if, in place of human understanding and judgement calls, we can blindly use some ground-truth-correct definition of what is and isn’t a real-world concept.
Yep. The idea is to try and get a system that develops all practically useful “theoretical” abstractions, including those we haven’t discovered yet, without developing desires about the real world. So we train some component of it on the real-world data, then somehow filter out “real-world” stuff, leaving only a purified superhuman abstract reasoning engine.
One of the nice-to-have properties here would be is if we don’t need to be able to interpret its world-model to filter out the concepts – if, in place of human understanding and judgement calls, we can blindly use some ground-truth-correct definition of what is and isn’t a real-world concept.