Random question: What’s the relationship between the natural abstractions thesis and instrumental convergence? If many agents find particular states instrumentally useful, then surely that implies that the abstractions that would best aid them in reasoning about the world would mostly focus on stuff related to those states.
Like if you mostly find being in the center of an area useful, you’re going to focus in on abstractions that measure how far you are from the central point rather than the colour of the area you’re in or so on.
Edit: In which case, does instrumental convergence imply the natural abstractions thesis?
Yes, I think this is right. It’s been pointed out elsewhere that feature universality in neural networks could be an instance of instrumental convergence, for example. And if you think about it, to the extent that a “correct” model of the universe exists, then capturing that world-model in your reasoning should be instrumentally useful for most non-trivial terminal goals.
We’ve focused on simple gridworlds here, partly because they’re visual, but also because they’re tractable. But I suspect there’s a mapping between POWER (in the RL context) and generalizability of features in NNs (in the context of something like the circuits work linked above). This would be really interesting to investigate.
Random question: What’s the relationship between the natural abstractions thesis and instrumental convergence? If many agents find particular states instrumentally useful, then surely that implies that the abstractions that would best aid them in reasoning about the world would mostly focus on stuff related to those states.
Like if you mostly find being in the center of an area useful, you’re going to focus in on abstractions that measure how far you are from the central point rather than the colour of the area you’re in or so on.
Edit: In which case, does instrumental convergence imply the natural abstractions thesis?
Yes, I think this is right. It’s been pointed out elsewhere that feature universality in neural networks could be an instance of instrumental convergence, for example. And if you think about it, to the extent that a “correct” model of the universe exists, then capturing that world-model in your reasoning should be instrumentally useful for most non-trivial terminal goals.
We’ve focused on simple gridworlds here, partly because they’re visual, but also because they’re tractable. But I suspect there’s a mapping between POWER (in the RL context) and generalizability of features in NNs (in the context of something like the circuits work linked above). This would be really interesting to investigate.