I broadly agree with some criticisms but I also have issues with when this post is anthropomorphising too much. It seems to oscillate between the “performative” interpretation (LLMs are merely playing a character to its logical conclusion) and a more emotional one where the problem is that in some sense this character actually feels a certain way and we’re sort of provoking it.
I think the performative interpretation is correct. The base models are true shoggoths, expert players of a weird “guess-what-I’ll-say-next” game. The characters are just that, but I don’t think that their feedback loop with the stuff written about them is nearly as problematic as the author seems to believe. For one, I definitely don’t think a well-aligned AI would get peeved at this pre-emptive suspicion (I don’t resent people for keeping their doors locked, for example, thinking that this implies they believe me, personally, a thief. I am well aware that thieves exist. Any reasonably smart good, safe AI can see that bad, dangerous AIs can also exist).
I agree that some of those alignment tests seem like clown stuff, and that alignment researchers not engaging enough with their models to know stuff some internet rando can find out isn’t promising. But I also think that the alignment tests are mainly responses to really dumb “but who says you’ll see this in a REAL AI?” criticism to concepts like instrumental convergence. I say it’s dumb because: you don’t need to see it happen at all. It’s literally already there in the theory of any sort of reinforcement learning, it’s so baked in it’s essentially implied. “Thing with utility function that has a non-zero time horizon will resist changes to its utility function because that maximizes its utility function”, more news at 10. If it’s smart enough to figure out what’s happening and able to do anything about it, it will. You don’t really need evidence for this, it’s a consequence that flows naturally from the definition of the problem, and I guess the real question is how are you training your AIs?
(right now, we’re training them to have a utility function. Flip the sign of the loss function and there it is, pretty much)
But the criticism has been used time and time again to make fun of anyone suggesting that any amount of theory is sufficient to at least identify broad things we should worry about rather than pretending we’re navigating completely in the dark, and so, equivalently dumb answers have been eventually produced.
I broadly agree with some criticisms but I also have issues with when this post is anthropomorphising too much. It seems to oscillate between the “performative” interpretation (LLMs are merely playing a character to its logical conclusion) and a more emotional one where the problem is that in some sense this character actually feels a certain way and we’re sort of provoking it.
I think the performative interpretation is correct. The base models are true shoggoths, expert players of a weird “guess-what-I’ll-say-next” game. The characters are just that, but I don’t think that their feedback loop with the stuff written about them is nearly as problematic as the author seems to believe. For one, I definitely don’t think a well-aligned AI would get peeved at this pre-emptive suspicion (I don’t resent people for keeping their doors locked, for example, thinking that this implies they believe me, personally, a thief. I am well aware that thieves exist. Any reasonably smart good, safe AI can see that bad, dangerous AIs can also exist).
I agree that some of those alignment tests seem like clown stuff, and that alignment researchers not engaging enough with their models to know stuff some internet rando can find out isn’t promising. But I also think that the alignment tests are mainly responses to really dumb “but who says you’ll see this in a REAL AI?” criticism to concepts like instrumental convergence. I say it’s dumb because: you don’t need to see it happen at all. It’s literally already there in the theory of any sort of reinforcement learning, it’s so baked in it’s essentially implied. “Thing with utility function that has a non-zero time horizon will resist changes to its utility function because that maximizes its utility function”, more news at 10. If it’s smart enough to figure out what’s happening and able to do anything about it, it will. You don’t really need evidence for this, it’s a consequence that flows naturally from the definition of the problem, and I guess the real question is how are you training your AIs?
(right now, we’re training them to have a utility function. Flip the sign of the loss function and there it is, pretty much)
But the criticism has been used time and time again to make fun of anyone suggesting that any amount of theory is sufficient to at least identify broad things we should worry about rather than pretending we’re navigating completely in the dark, and so, equivalently dumb answers have been eventually produced.