This is helpful, thanks. It sounds very reasonable to say “if it’s just programmed to build a model and query it, it doesn’t matter if it’s self-aware”. And it might be true, although I’m still a bit uncertain about what can happen when the model-builder includes itself in its models. There are also questions of what properties can be easily and rigorously verified. My hope here is that we can flag some variables as “has information about the world-model” and other variables as “has information about oneself”, and we can do some kind of type-checking or formal verification that they don’t intermingle. If something like that is possible, it would seem to be a strong guarantee of safety even if we didn’t understand how the world-modeler worked in full detail.
RE your last paragraph: I don’t think there is any point ever when we will have a safe AI and no one is incentivized (or even curious) to explore alternate designs that are not known to be safe (but which would be more powerful if they worked). So we need to get to some point of development, and then sound the buzzer and start relying 100% on other solutions, whether it’s OpenAI becoming our benevolent world dictators, or hoping that our AI assistants will tell us what to do next, or who knows what. I think an oracle that can answer arbitrary questions and invent technology is good enough for that. Once we’re there, I think we’ll be more than ready to move to that second stage...
This is helpful, thanks. It sounds very reasonable to say “if it’s just programmed to build a model and query it, it doesn’t matter if it’s self-aware”. And it might be true, although I’m still a bit uncertain about what can happen when the model-builder includes itself in its models. There are also questions of what properties can be easily and rigorously verified. My hope here is that we can flag some variables as “has information about the world-model” and other variables as “has information about oneself”, and we can do some kind of type-checking or formal verification that they don’t intermingle. If something like that is possible, it would seem to be a strong guarantee of safety even if we didn’t understand how the world-modeler worked in full detail.
RE your last paragraph: I don’t think there is any point ever when we will have a safe AI and no one is incentivized (or even curious) to explore alternate designs that are not known to be safe (but which would be more powerful if they worked). So we need to get to some point of development, and then sound the buzzer and start relying 100% on other solutions, whether it’s OpenAI becoming our benevolent world dictators, or hoping that our AI assistants will tell us what to do next, or who knows what. I think an oracle that can answer arbitrary questions and invent technology is good enough for that. Once we’re there, I think we’ll be more than ready to move to that second stage...