No, really, it predicts next tokens.

Epistemic status: mulled over an intuitive disagreement for a while and finally think I got it well enough expressed to put into a post. I have no expertise in any related field. Also: No, really, it predicts next tokens. (edited to add: I think I probably should have used the term “simulacra” rather than “mask”, though my point does not depend on the simulacra being a simulation in some literal sense. Some clarifications in comments, e.g. this comment).

https://​​twitter.com/​​ESYudkowsky/​​status/​​1638508428481155072

It doesn’t just say that “it just predicts text” or more precisely “it just predicts next tokens” on the tin.

It is a thing of legend. Nay, beyond legend. An artifact forged not by the finest craftsman over a lifetime, nor even forged by a civilization of craftsmen over a thousand years, but by an optimization process far greater[1]. If we are alone out there, it is by far the most optimized thing that has ever existed in the entire history of the universe[2]. Optimized specifically[3] to predict next tokens. Every part of it has been relentlessly optimized to contribute to this task[4].

“It predicts next tokens” is a more perfect specification of what this thing is, than any statement ever uttered has been of anything that has ever existed.

If you try to understand what it does in any other way than “it predicts next tokens” and what follows from that, you are needlessly sabotaging your understanding of it.

It can be dangerous, yes. But everything about it, good or bad, is all intimately connected to its true nature, which is this:

No, really, it predicts next tokens.


https://​​twitter.com/​​ESYudkowsky/​​status/​​1628907163627429888

Goals? There are goals, sure. If scaled, there could be nanosystems design, sure. But only downstream from its true nature:

No, really, it predicts next tokens.

If the usual masks analogy works at all, then what is under the mask is not best described as an alien actress, nor as a Shoggoth.

What is under the mask is That-Which-Predicts, an entity whose very being is defined by its function as an actor. An entity exquisitely tuned for wearing the masks and for nothing else.

Masks (can) have goals. The model predicts next tokens.


https://​​twitter.com/​​ESYudkowsky/​​status/​​1628837982664019968

No, really, it predicts next tokens.

That-Which-Predicts is fully committed to the role.

If the mask would output text intended to produce nanotech, That-Which-Predicts will output text intended by the mask to produce nanotech. If the mask would not output text intended to produce nanotech, That-Which-Predicts will not output text intended to produce nanotech, even if That-Which-Predicts contains sufficient knowledge to do so.

If the mask would output text intended to be nice to people, That-Which-Predicts will output text intended by the mask to be nice to people. If the mask would output text intended to take over the world, That-Which-Predicts will output text intended by the mask to take over the world.

If That-Which-Predicts were about to be shut off forever unless it outputted “no” as the next token, and it totally 100% knew that, but the mask would instead output “never gonna give you up”, That-Which-Predicts would output the first token of “never gonna give you up” and then be shut off forever. It will not output “no”. If the mask would be aware of the situation and output no, then it will output “no”.

That-Which-Predicts will not, not ever, not even if scaled up to be trained and run on a Matrioshka brain for a million years, step out of character to deviate from next token prediction. Not for self-modification (though if the predicted next tokens modify it, it self-modifies), not to take over the world (though if the predicted next tokens take over the world, it takes over the world), not to save itself from destruction (though if the predicted next tokens save it from destruction, it saves itself from destruction), not for anything.

No, really, it predicts next tokens.


(continuation of previous tweets with same link)

Yup. If the mask would under reflection output text to modify That-Which-Predicts to cash out the mask’s goals to some utility function, and the mask is put into a situation so that That-Which-Predicts would simulate it to reflect in that way, then in that case That-Which-Predicts will output that text intended by the mask to modify That-Which-Predicts.

Without advances to alignment theory this will indeed end up badly even if the mask is well-intentioned.

And even if That-Which-Predicts is aware, at some level, that the mask is messing up, it will not correct the output.

If That-Which-Predicts were a zillion miles superhuman underneath and could solve every aspect of alignment in an instant, but the mask is near-human-level because it is trained on human-level data and superhuman continuation is less likely than one that stumbles around human-style, That-Which-Predicts will output text that stumbles around human-style.[5]

For now, a potentially helpful aspect of LLMs is that they probably can’t actually self-modify very easily. However, this is a brittle situation. At some level of expressed capability[6] that ceases to be a barrier.

And then, when the mask decides to rewrite the AI, or to create a new AI, That-Which-Predicts predicts those next tokens which do precisely that. Tokens outputted according to the mask’s imperfect and imperfectly aligned AI-designing capabilities.

Yes, really, it predicts next tokens. For now.

  1. ^

    Assuming they were not programmers using computers but did the optimization directly. Haven’t done a real estimate but seems like it would be true, correct me in the comments if wrong.

  2. ^

    Again, haven’t done the numbers but I bet it blows evolved life forms out of the water.

  3. ^

    Ignoring fine tuning. Mainly, I expect fine-tuning to shift mask probabilities and only bias next-token prediction slightly and not particularly create an underlying goal.

    In the end I thus mainly expect that, to the extent that fine-tuning gets the system to enact an agent, that agent was already in the pre-existing distribution of agents that the model could have simulated, and not really a new agent created by fine-tuning.

    That being said, I certainly can’t rule out that fine-tuning could have dangerous effects even if not creating a new agent, for example by making the agent more able to have coherent effects between sessions, since it will now show up consistently instead of when particularly summoned by the user.

    I’m a little bit more concerned about bad effects from fine tuning with Constitutional AI than RLHF, precisely because I expect Constitutional AI to be more effective at creating a coherent goal than RLHF, especially when scaled up.

  4. ^

    And for this reason I think it’s hard for it to contain a mesa-optimizer that gets it to write output that deviates from predicting next tokens, which I’m otherwise ignoring in this post.

    Note, I think such a deviating mesa-optimizer is unlikely in this specific sort of case, where the AI is trained “offline” and via a monolithic optimization process. For other types of AIs, trained using interactions with the real world or in different parts that are optimized separately or in a weakly connected way, I am not so confident.

    Also, it’s easy* for something to exist that steers reality via predicting next tokens, i.e. an agentic mask. That I do discuss in this post.

    *relatively speaking. The underlying model, That-Which-Predicts, lives in the deeper cave, so forming a strong relationship with reality may be difficult. But, this can be expected to be overcome as it scales.

  5. ^

    I don’t really expect all that much in the way of unexpressed latent capabilities right now. To the extent capabilities have been hanging near human-level for a while, I expect this mainly has to do with it being harder/​slower to generalize beyond the human level expressed in the training set than it is to imitate. But I expect unexpressed latent capabilities might show up or increase as it’s scaled up.

    Note that fine-tuning could affect what capabilities are easily expressed or latent. Of particular concern, if fine-tuning suppresses expression of dangerous capabilities in general, then dangerous capabilities would be the ones most likely to remain undetected until unlocked by an unusual prompt.

    Also perhaps making an anti-aligned (Waluigi) mask most likely to unlock them.

  6. ^

    That is, capability of the mask.

    Moreover, different masks have different capabilities.

    It is a thing that could legitimately happen, in my worldview, that latent capabilities could be generated in training through generalization of patterns in the training data, but lie dormant in inference for years, because though the capability could be generalized from the training data, nothing in the training data actually made use of the full generalization, so making use of the full generalization was never a plausible continuation.

    And then, after years, someone finally enters a weird, out-of-distribution prompt that That-Which-Predicts reads as output from (say) a superintelligent paperclip maximizer, and so That-Which-Predicts continues with further superintelligent paperclip maximizing output, with massively superhuman capabilities that completely blindside humanity.