Yes, of course. Despite its stochastic nature, it is extraordinarily unlikely for an advanced LLM to respond with anything other than 2 + 2 = 4 or Paris for the capital of France. A stochastic phenomenon can, in practice, tend toward deterministic behavior. However, deception in a context such as the one discussed in Apollo Research’s article is not really comparable to answering 2 + 2 = ?. What the article demonstrates is that we are dealing with tendencies, accompanied by considerable randomness, including in the intensity of the deception.
Assuming a more sophisticated model has roughly double the deception capability of model o1, it would be enough to increase the sample size of responses for the anomaly to become glaringly obvious. One could also imagine a more rigorous test involving even more complex situations. It does not seem inconceivable that such a procedure could, for years to come—and perhaps even at the stage of the first generations of AGI—identify deceptive behaviors and establish an RL procedure based on this test.
My concern here is a bit different: I expect takeover-capable LLMs to have some sort of continuous learning or memory. Leaving them safely amnesic and vulnerable to the type of test you describe would leave too many capabilities on the table for devs to do it voluntarily.
Current systems have just a little bit of this type of memory. ChatGPT has both explicit and implicit memory over chats for each user. Perhaps we can hope that nobody puts the effort in to create better learning systems. But I read articles on attempts weekly, and I know at least some startups are working on continuous learning in stealth mode.
And it’s not clear those memory systems need to work very well to provide both some capabilities and a lot of risk of making goals more regular/universal and therefore make scheming more uniform and harder to detect.
I agree that continual or online training / memory would probably be disruptive both in terms of capabilities and risk. My idea would indeed fail in this case. It would be fascinating to chat with an online model, but I would also fear it goes out of control anytime.
As you mention, OpenAI introduced a little persistent memory in ChatGPT since version 4o (or was it 4 ?). While I also use other models, nevertheless ChatGPT has now an impressive persistent memory of our discussions since more than a year. I also observe that even such a modest memory has a significant effect. The model sometimes surprises me by establishing a reference to another idea discussed long ago. Establishing such links is certainly part of intelligence.
Yes, of course. Despite its stochastic nature, it is extraordinarily unlikely for an advanced LLM to respond with anything other than 2 + 2 = 4 or Paris for the capital of France. A stochastic phenomenon can, in practice, tend toward deterministic behavior. However, deception in a context such as the one discussed in Apollo Research’s article is not really comparable to answering 2 + 2 = ?. What the article demonstrates is that we are dealing with tendencies, accompanied by considerable randomness, including in the intensity of the deception.
Assuming a more sophisticated model has roughly double the deception capability of model o1, it would be enough to increase the sample size of responses for the anomaly to become glaringly obvious. One could also imagine a more rigorous test involving even more complex situations. It does not seem inconceivable that such a procedure could, for years to come—and perhaps even at the stage of the first generations of AGI—identify deceptive behaviors and establish an RL procedure based on this test.
My concern here is a bit different: I expect takeover-capable LLMs to have some sort of continuous learning or memory. Leaving them safely amnesic and vulnerable to the type of test you describe would leave too many capabilities on the table for devs to do it voluntarily.
I wrote about this in LLM AGI will have memory, and memory changes alignment and LLM AGI may reason about its goals and discover misalignments by default, and elsewhere.
Current systems have just a little bit of this type of memory. ChatGPT has both explicit and implicit memory over chats for each user. Perhaps we can hope that nobody puts the effort in to create better learning systems. But I read articles on attempts weekly, and I know at least some startups are working on continuous learning in stealth mode.
And it’s not clear those memory systems need to work very well to provide both some capabilities and a lot of risk of making goals more regular/universal and therefore make scheming more uniform and harder to detect.
I agree that continual or online training / memory would probably be disruptive both in terms of capabilities and risk. My idea would indeed fail in this case. It would be fascinating to chat with an online model, but I would also fear it goes out of control anytime.
As you mention, OpenAI introduced a little persistent memory in ChatGPT since version 4o (or was it 4 ?). While I also use other models, nevertheless ChatGPT has now an impressive persistent memory of our discussions since more than a year. I also observe that even such a modest memory has a significant effect. The model sometimes surprises me by establishing a reference to another idea discussed long ago. Establishing such links is certainly part of intelligence.