My concern here is a bit different: I expect takeover-capable LLMs to have some sort of continuous learning or memory. Leaving them safely amnesic and vulnerable to the type of test you describe would leave too many capabilities on the table for devs to do it voluntarily.
Current systems have just a little bit of this type of memory. ChatGPT has both explicit and implicit memory over chats for each user. Perhaps we can hope that nobody puts the effort in to create better learning systems. But I read articles on attempts weekly, and I know at least some startups are working on continuous learning in stealth mode.
And it’s not clear those memory systems need to work very well to provide both some capabilities and a lot of risk of making goals more regular/universal and therefore make scheming more uniform and harder to detect.
I agree that continual or online training / memory would probably be disruptive both in terms of capabilities and risk. My idea would indeed fail in this case. It would be fascinating to chat with an online model, but I would also fear it goes out of control anytime.
As you mention, OpenAI introduced a little persistent memory in ChatGPT since version 4o (or was it 4 ?). While I also use other models, nevertheless ChatGPT has now an impressive persistent memory of our discussions since more than a year. I also observe that even such a modest memory has a significant effect. The model sometimes surprises me by establishing a reference to another idea discussed long ago. Establishing such links is certainly part of intelligence.
My concern here is a bit different: I expect takeover-capable LLMs to have some sort of continuous learning or memory. Leaving them safely amnesic and vulnerable to the type of test you describe would leave too many capabilities on the table for devs to do it voluntarily.
I wrote about this in LLM AGI will have memory, and memory changes alignment and LLM AGI may reason about its goals and discover misalignments by default, and elsewhere.
Current systems have just a little bit of this type of memory. ChatGPT has both explicit and implicit memory over chats for each user. Perhaps we can hope that nobody puts the effort in to create better learning systems. But I read articles on attempts weekly, and I know at least some startups are working on continuous learning in stealth mode.
And it’s not clear those memory systems need to work very well to provide both some capabilities and a lot of risk of making goals more regular/universal and therefore make scheming more uniform and harder to detect.
I agree that continual or online training / memory would probably be disruptive both in terms of capabilities and risk. My idea would indeed fail in this case. It would be fascinating to chat with an online model, but I would also fear it goes out of control anytime.
As you mention, OpenAI introduced a little persistent memory in ChatGPT since version 4o (or was it 4 ?). While I also use other models, nevertheless ChatGPT has now an impressive persistent memory of our discussions since more than a year. I also observe that even such a modest memory has a significant effect. The model sometimes surprises me by establishing a reference to another idea discussed long ago. Establishing such links is certainly part of intelligence.