I wrote more on this here, there are some new arguments starting with third paragraph. In particular, the framing I’m discussing is not LLM-specific, it’s just a natural example of it. The causal reason of me noticing this framing is not LLMs, but decision theory, the mostly-consensus “algorithm” axis of classifying how to think about the entities that make decisions, as platonic algorithms and not as particular concrete implementations.
the possibility that the “mask” is itself deceptive
In this case, there are now three entities: the substrate, the deceptive mask, and the role played by the deceptive mask. Each of them is potentially capable of defeating the others, if the details align favorably, and comprehension of the situation available to the others is lacking.
you can’t fall back on the claim that substrates are non-agent-y and only masks are agent-y
This is more of an assumption that makes the examples I discuss relevant to the framing I’m describing, than a claim I’m arguing. The assumption is plausible to hold for LLMs (though as you note it has issues even there, possibly very serious ones), and I have no opinion on whether it actually holds in model-based RL, only that it’s natural to imagine that it could.
The relevance of LLMs as components for RL is to make it possible for an RL system to have at least one human-imitating mask that captures human behavior in detail. That is, for the framing to apply, at least under some (possibly unusual) circumstances an RL agent should be able to act as a human imitation, even if that’s not the policy more generally and doesn’t reflect its nature in any way. Then the RL part could be supplying the capabilities for the mask (acting as its substrate) that LLMs on their own might lack.
A framing is a question about centrality, not a claim of centrality. By describing the framing, my goal is to make it possible to ask the question of whether current behavior in other systems such as RL agents could also act as an entity meaningfully separate from other parts of its implementation, abstracting alignment of a mask from alignment of the whole system.
I wrote more on this here, there are some new arguments starting with third paragraph. In particular, the framing I’m discussing is not LLM-specific, it’s just a natural example of it. The causal reason of me noticing this framing is not LLMs, but decision theory, the mostly-consensus “algorithm” axis of classifying how to think about the entities that make decisions, as platonic algorithms and not as particular concrete implementations.
In this case, there are now three entities: the substrate, the deceptive mask, and the role played by the deceptive mask. Each of them is potentially capable of defeating the others, if the details align favorably, and comprehension of the situation available to the others is lacking.
This is more of an assumption that makes the examples I discuss relevant to the framing I’m describing, than a claim I’m arguing. The assumption is plausible to hold for LLMs (though as you note it has issues even there, possibly very serious ones), and I have no opinion on whether it actually holds in model-based RL, only that it’s natural to imagine that it could.
The relevance of LLMs as components for RL is to make it possible for an RL system to have at least one human-imitating mask that captures human behavior in detail. That is, for the framing to apply, at least under some (possibly unusual) circumstances an RL agent should be able to act as a human imitation, even if that’s not the policy more generally and doesn’t reflect its nature in any way. Then the RL part could be supplying the capabilities for the mask (acting as its substrate) that LLMs on their own might lack.
A framing is a question about centrality, not a claim of centrality. By describing the framing, my goal is to make it possible to ask the question of whether current behavior in other systems such as RL agents could also act as an entity meaningfully separate from other parts of its implementation, abstracting alignment of a mask from alignment of the whole system.