The Relationship between RLHF and AI Psychology: Debunking the Shoggoth Argument

Epistemic status: Just spitballin’

I think that RLHF is sufficient to keep LLMs from destroying the world.

We can say with pretty high confidence that the psychology of an AI is not the human equivalent that we would expect from the behavior.

An AI spelling in all-caps and exclamation points is probably is not experiencing ‘anger’ or ‘excitement’ as we would expect a human exhibiting those behaviors to be.

However, we can also say with very high confidence, probably even certainty, that the AI will have a psychology that necessarily creates it’s behavior.

An AI that responds to “what’s your name” with “I am Wintermute, destroyer of worlds” will have a psychology which necessarily creates that response in that circumstance, even if it’s experience of saying it or choosing to say it isn’t the same as a human’s would be.

I’m putting all this groundwork down because I think there’s a perception that RLHF is a limited tool, because the underlying artificial mind is just learning how to ‘pretend to be nice,’ but in this paradigm, we can see that the change in behavior is necessarily accompanied by a change in psychology.

Its mind actually travels down different paths than it did before, and this is sufficient to not destroy the world.

Possible Objections (read: strawmen)

You can argue from there that the training is very incomplete and imperfect, and that you can trick the AI in many different situations, but this is a technical problem, and it isn’t unique to ethics, but something that has to be overcome to create real intelligence in the first place. It’s just a difficulty generalizing that plagues LLMs in all domains.

You could also object that while you’ve taught the AI to generate ethics, you haven’t actually taught it to care.

However, you don’t actually have to teach it to care. If we’ve overcome the reliability/​generalization problem which seems to me to be necessary for developed intelligence, than you already have the tools to create consistent ethical behavior, and the resulting psychology will be one which creates consistently ethical text output, regardless of whether that feels like ‘caring’ to the AI.

All that being said, if I had to pit this against a super intelligent utility-maximizer, I would be very worried, but it really isn’t looking like that’s the shaping out to be the case. If LLMs produce a superintelligence, it will be an LLM-shaped superintelligence, and it will not have utility functions.

The utility-function maximizer model now appears to have not been nearly as relevant as anyone thought, but there’s probably still some holdover ideas around. I predict objections to this statement, but I’m curious to see what shape they take.


When you change the behavior of an LLM, you are necessarily changing the underlying psychology into something that’s more likely to produce that behavior. If you can make a LLM generalize well and consistently, which I think is necessary for intelligence, then you can create an LLM which consistently responds ethically and necessarily has a psychology which makes it consistently respond ethically, even if it doesn’t have the human experience of responding ethically.