Another potential complication is hard to get at philosophically, but it could be described as “the AIs will have something analogous to free will”. Specifically, they will likely have a process where the AI can learn from experience, and resolve conflicts between incompatible values and goals it already holds.
If this is the case, then it’s entirely possible that the AI’s goals will adjust over time, in response to new information, or even just thanks to “contemplation” and strategizing. (AIs that can’t adjust to changing circumstances and draw their own conclusions are unlikely to compete with other AIs that can.)
But if the AI’s values and goals can be updated, then ensuring even vague alignment gets even harder.
Just for fun, I’ve been arguing with the AI that “We’re probably playing Russian roulette with 5 chambers loaded, not 6. This is still a very dumb idea and we should stop.” The AI is very determined to convince me that we’re playing with 6 chambers, not 5.
The primary outcomes of this process have been to remind me:
Why I don’t “talk” to LLMs about anything except technical problems or capability evals. “Talking to LLMs” always winds up feeling like “Talking to the fey”. It’s a weird alien intelligence with unknowable goals, and mythology is pretty clear that talking to the fey rarely ends well.
Why I don’t talk to cult members. “Yes, yes, you have an answer to everything, and a completely logical reason why I should believe in our Galatic Overlord Unex. But my counter argument is: No thank you, piss off.[1]”
Ironically, why I really should be concerned about LLM persuasiveness. The LLM is more interested in regurgitating IABIED at me than it is in convincing me. And yet it’s still pretty convincing. If it were better at modeling my argument, it could probably be very convincing.
So an impressive effort, and an interesting experiment. Whether it works, and whether it is a good thing, are questions to which I don’t have an immediate answer.
G.K. Chesterton, in Orthodoxy: “Such is the madman of experience; he is commonly a reasoner, frequently a successful reasoner. Doubtless he could be vanquished in mere reason, and the case against him put logically. But it can be put much more precisely in more general and even æsthetic terms. He is in the clean and well-lit prison of one idea: he is sharpened to one painful point. He is without healthy hesitation and healthy complexity.”