Here’s a more general way of thinking about what you’re saying that I find useful: It’s not that self-awareness is the issue per se, it’s that you’re avoiding building an agent—by a specific technical definition of “agent.”
Agents, in the sense I think is most useful when thinking about AI, are things that choose actions based on the predicted consequences of those actions.
On some suitably abstract level of description, agents must have available actions, they must have some model of the world that includes a free parameter for different actions, and they must have a criterion for choosing actions that’s a function of what the model predicts will happen when it takes those actions. Agents are what is dangerous, because they steer the future based on their criterion.
What you describe in this post is an AI that has actions (outputting text to a text channel), and has a model of the world. But maybe, you say, we can make it not an agent, and therefore a lot less dangerous, by making it so that there is no free parameter in the model for the agent to try out different actions. and instead of choosing its action based on consequences, it will just try to describe what its model predicts.
Thinking about it in terms of agents like this explains why “knowing that it’s running on a specific computer” has the causal powers that it does—it’s a functional sort of “knowing” that involves having your model of the world impacted by your available actions in a specific way. Simply putting “I am running on this specific computer” into its memory wouldn’t make it an agent—and if it chooses what text to output based on predicted consequences, it’s an agent whether or not it has “I am running on this specific computer” in its memory.
So, could this work? Yes. It would require a lot of hard, hard work on the input/output side, especially if you want reliable natural language interaction with a model of the entire world, and you still have to worry about the inner optimizer problem, particularly e.g. if you’re training it in a way that creates an incentive for self-fulfilling prophecy or some other implicit goal.
The basic reason I’m pessimistic about the general approach of figuring out how to build safe non-agents is that agents are really useful. If your AI design requires a big powerful model of the entire world, that means that someone is going to build an agent using that big powerful model very soon after. Maybe this tool gives you some breathing room by helping suppress competitors, or maybe it makes it easier to figure out how to build safe agents. But it seems more likely to me that we’ll get a good outcome by just directly figuring out how to build safe agents.
This is helpful, thanks. It sounds very reasonable to say “if it’s just programmed to build a model and query it, it doesn’t matter if it’s self-aware”. And it might be true, although I’m still a bit uncertain about what can happen when the model-builder includes itself in its models. There are also questions of what properties can be easily and rigorously verified. My hope here is that we can flag some variables as “has information about the world-model” and other variables as “has information about oneself”, and we can do some kind of type-checking or formal verification that they don’t intermingle. If something like that is possible, it would seem to be a strong guarantee of safety even if we didn’t understand how the world-modeler worked in full detail.
RE your last paragraph: I don’t think there is any point ever when we will have a safe AI and no one is incentivized (or even curious) to explore alternate designs that are not known to be safe (but which would be more powerful if they worked). So we need to get to some point of development, and then sound the buzzer and start relying 100% on other solutions, whether it’s OpenAI becoming our benevolent world dictators, or hoping that our AI assistants will tell us what to do next, or who knows what. I think an oracle that can answer arbitrary questions and invent technology is good enough for that. Once we’re there, I think we’ll be more than ready to move to that second stage...
Here’s a more general way of thinking about what you’re saying that I find useful: It’s not that self-awareness is the issue per se, it’s that you’re avoiding building an agent—by a specific technical definition of “agent.”
Agents, in the sense I think is most useful when thinking about AI, are things that choose actions based on the predicted consequences of those actions.
On some suitably abstract level of description, agents must have available actions, they must have some model of the world that includes a free parameter for different actions, and they must have a criterion for choosing actions that’s a function of what the model predicts will happen when it takes those actions. Agents are what is dangerous, because they steer the future based on their criterion.
What you describe in this post is an AI that has actions (outputting text to a text channel), and has a model of the world. But maybe, you say, we can make it not an agent, and therefore a lot less dangerous, by making it so that there is no free parameter in the model for the agent to try out different actions. and instead of choosing its action based on consequences, it will just try to describe what its model predicts.
Thinking about it in terms of agents like this explains why “knowing that it’s running on a specific computer” has the causal powers that it does—it’s a functional sort of “knowing” that involves having your model of the world impacted by your available actions in a specific way. Simply putting “I am running on this specific computer” into its memory wouldn’t make it an agent—and if it chooses what text to output based on predicted consequences, it’s an agent whether or not it has “I am running on this specific computer” in its memory.
So, could this work? Yes. It would require a lot of hard, hard work on the input/output side, especially if you want reliable natural language interaction with a model of the entire world, and you still have to worry about the inner optimizer problem, particularly e.g. if you’re training it in a way that creates an incentive for self-fulfilling prophecy or some other implicit goal.
The basic reason I’m pessimistic about the general approach of figuring out how to build safe non-agents is that agents are really useful. If your AI design requires a big powerful model of the entire world, that means that someone is going to build an agent using that big powerful model very soon after. Maybe this tool gives you some breathing room by helping suppress competitors, or maybe it makes it easier to figure out how to build safe agents. But it seems more likely to me that we’ll get a good outcome by just directly figuring out how to build safe agents.
This is helpful, thanks. It sounds very reasonable to say “if it’s just programmed to build a model and query it, it doesn’t matter if it’s self-aware”. And it might be true, although I’m still a bit uncertain about what can happen when the model-builder includes itself in its models. There are also questions of what properties can be easily and rigorously verified. My hope here is that we can flag some variables as “has information about the world-model” and other variables as “has information about oneself”, and we can do some kind of type-checking or formal verification that they don’t intermingle. If something like that is possible, it would seem to be a strong guarantee of safety even if we didn’t understand how the world-modeler worked in full detail.
RE your last paragraph: I don’t think there is any point ever when we will have a safe AI and no one is incentivized (or even curious) to explore alternate designs that are not known to be safe (but which would be more powerful if they worked). So we need to get to some point of development, and then sound the buzzer and start relying 100% on other solutions, whether it’s OpenAI becoming our benevolent world dictators, or hoping that our AI assistants will tell us what to do next, or who knows what. I think an oracle that can answer arbitrary questions and invent technology is good enough for that. Once we’re there, I think we’ll be more than ready to move to that second stage...