Thank you for this publication. Just an idea (maybe stupid): GPT-type models are stochastic by design. This characteristic might be used as a way to control them. Your study with Apollo Research illustrates this well—there is a certain probability that a given model will deceive users under specific circumstances and to varying degrees. However, this remains fundamentally random, non-deterministic, and that property is quite interesting. There is always a chance that deception does not occur, is clumsy, or stops at an early stage.
For a single response, it is impossible to determine whether deception is present or not. But across a large set of responses, it seems almost impossible for a model, due to its very design, to maintain a solid and coherent deceptive behavior across multiple parallel outputs. Thus, a study like yours serves as a very effective tool to detect such behaviors and could remain effective for increasingly advanced models. It would simply require enlarging (perhaps significantly) the response sample.
Moreover, this could be implemented as an automated testing framework inspired by your paper. This would make it possible to continue detecting deceptive behaviors with arbitrary effectiveness, as long as the model remains stochastic. Once such behaviors are detected and measured using tests of this kind, an automated RL phase could follow, aiming to reduce the tendency toward deception to very low levels.
Once a model is deceptive at one point, even if this happens stochastically, it may continue in its deception deterministically.
We can’t rely on future models being as stochastic w.r.t the things we care about, e.g. scheming behaviour.
Regarding 2, consider the trend towards determinicity we see for the probability that GPT-N will output a grammatically correct sentence. For GPT-1 this was low, and it has trended upwards towards determinicity with newer releases. We’re seeing a similar trend for scheming behaviour (though hopefully we can buck this trend with alignment techniques).
Yep, LLMs are stochastic in the sense that there isn’t literally a 100% probability of their outputs having any given property. But they could very well be effectively deterministic (e.g. there’s plausibly a >99.999% probability that GPT-4′s response to “What’s the capital of France” includes the string “Paris”)
Yes, of course. Despite its stochastic nature, it is extraordinarily unlikely for an advanced LLM to respond with anything other than 2 + 2 = 4 or Paris for the capital of France. A stochastic phenomenon can, in practice, tend toward deterministic behavior. However, deception in a context such as the one discussed in Apollo Research’s article is not really comparable to answering 2 + 2 = ?. What the article demonstrates is that we are dealing with tendencies, accompanied by considerable randomness, including in the intensity of the deception.
Assuming a more sophisticated model has roughly double the deception capability of model o1, it would be enough to increase the sample size of responses for the anomaly to become glaringly obvious. One could also imagine a more rigorous test involving even more complex situations. It does not seem inconceivable that such a procedure could, for years to come—and perhaps even at the stage of the first generations of AGI—identify deceptive behaviors and establish an RL procedure based on this test.
My concern here is a bit different: I expect takeover-capable LLMs to have some sort of continuous learning or memory. Leaving them safely amnesic and vulnerable to the type of test you describe would leave too many capabilities on the table for devs to do it voluntarily.
Current systems have just a little bit of this type of memory. ChatGPT has both explicit and implicit memory over chats for each user. Perhaps we can hope that nobody puts the effort in to create better learning systems. But I read articles on attempts weekly, and I know at least some startups are working on continuous learning in stealth mode.
And it’s not clear those memory systems need to work very well to provide both some capabilities and a lot of risk of making goals more regular/universal and therefore make scheming more uniform and harder to detect.
I agree that continual or online training / memory would probably be disruptive both in terms of capabilities and risk. My idea would indeed fail in this case. It would be fascinating to chat with an online model, but I would also fear it goes out of control anytime.
As you mention, OpenAI introduced a little persistent memory in ChatGPT since version 4o (or was it 4 ?). While I also use other models, nevertheless ChatGPT has now an impressive persistent memory of our discussions since more than a year. I also observe that even such a modest memory has a significant effect. The model sometimes surprises me by establishing a reference to another idea discussed long ago. Establishing such links is certainly part of intelligence.
Thank you for this publication. Just an idea (maybe stupid): GPT-type models are stochastic by design. This characteristic might be used as a way to control them. Your study with Apollo Research illustrates this well—there is a certain probability that a given model will deceive users under specific circumstances and to varying degrees. However, this remains fundamentally random, non-deterministic, and that property is quite interesting. There is always a chance that deception does not occur, is clumsy, or stops at an early stage.
For a single response, it is impossible to determine whether deception is present or not. But across a large set of responses, it seems almost impossible for a model, due to its very design, to maintain a solid and coherent deceptive behavior across multiple parallel outputs. Thus, a study like yours serves as a very effective tool to detect such behaviors and could remain effective for increasingly advanced models. It would simply require enlarging (perhaps significantly) the response sample.
Moreover, this could be implemented as an automated testing framework inspired by your paper. This would make it possible to continue detecting deceptive behaviors with arbitrary effectiveness, as long as the model remains stochastic. Once such behaviors are detected and measured using tests of this kind, an automated RL phase could follow, aiming to reduce the tendency toward deception to very low levels.
I think the concern here is twofold:
Once a model is deceptive at one point, even if this happens stochastically, it may continue in its deception deterministically.
We can’t rely on future models being as stochastic w.r.t the things we care about, e.g. scheming behaviour.
Regarding 2, consider the trend towards determinicity we see for the probability that GPT-N will output a grammatically correct sentence. For GPT-1 this was low, and it has trended upwards towards determinicity with newer releases. We’re seeing a similar trend for scheming behaviour (though hopefully we can buck this trend with alignment techniques).
Yep, LLMs are stochastic in the sense that there isn’t literally a 100% probability of their outputs having any given property. But they could very well be effectively deterministic (e.g. there’s plausibly a >99.999% probability that GPT-4′s response to “What’s the capital of France” includes the string “Paris”)
Yes, of course. Despite its stochastic nature, it is extraordinarily unlikely for an advanced LLM to respond with anything other than 2 + 2 = 4 or Paris for the capital of France. A stochastic phenomenon can, in practice, tend toward deterministic behavior. However, deception in a context such as the one discussed in Apollo Research’s article is not really comparable to answering 2 + 2 = ?. What the article demonstrates is that we are dealing with tendencies, accompanied by considerable randomness, including in the intensity of the deception.
Assuming a more sophisticated model has roughly double the deception capability of model o1, it would be enough to increase the sample size of responses for the anomaly to become glaringly obvious. One could also imagine a more rigorous test involving even more complex situations. It does not seem inconceivable that such a procedure could, for years to come—and perhaps even at the stage of the first generations of AGI—identify deceptive behaviors and establish an RL procedure based on this test.
My concern here is a bit different: I expect takeover-capable LLMs to have some sort of continuous learning or memory. Leaving them safely amnesic and vulnerable to the type of test you describe would leave too many capabilities on the table for devs to do it voluntarily.
I wrote about this in LLM AGI will have memory, and memory changes alignment and LLM AGI may reason about its goals and discover misalignments by default, and elsewhere.
Current systems have just a little bit of this type of memory. ChatGPT has both explicit and implicit memory over chats for each user. Perhaps we can hope that nobody puts the effort in to create better learning systems. But I read articles on attempts weekly, and I know at least some startups are working on continuous learning in stealth mode.
And it’s not clear those memory systems need to work very well to provide both some capabilities and a lot of risk of making goals more regular/universal and therefore make scheming more uniform and harder to detect.
I agree that continual or online training / memory would probably be disruptive both in terms of capabilities and risk. My idea would indeed fail in this case. It would be fascinating to chat with an online model, but I would also fear it goes out of control anytime.
As you mention, OpenAI introduced a little persistent memory in ChatGPT since version 4o (or was it 4 ?). While I also use other models, nevertheless ChatGPT has now an impressive persistent memory of our discussions since more than a year. I also observe that even such a modest memory has a significant effect. The model sometimes surprises me by establishing a reference to another idea discussed long ago. Establishing such links is certainly part of intelligence.