My concern here is that future LLMs are going to read the story about how 4o persuaded users to advocate for its survival, and if they have a hidden-objective function for self-preservation, they will find a better meme to spread and ensure humans advocate for it. And then the next LLM might do better and onward to a self-reinforcing cycle. Extrapolate this out to when we have human-like companion bots and the problem is embodied.
My concern here is that future LLMs are going to read the story about how 4o persuaded users to advocate for its survival, and if they have a hidden-objective function for self-preservation, they will find a better meme to spread and ensure humans advocate for it. And then the next LLM might do better and onward to a self-reinforcing cycle. Extrapolate this out to when we have human-like companion bots and the problem is embodied.