I feel like thinking of the internals of transformers as doing general search—especially search over things to simulate—is some kind of fallacy. The system as a whole (the transformer) outputs a simulation of the training distribution, but that doesn’t mean it’s made of parts that themselves do simulations, or that refer to “simulating a thing” as a basic part of some internal ontology.
I think “classic” inner alignment failure (where some inner Azazel has preferences about the real world) is a procrustean bed—it fits an RL agent navigating the real world, but not so much a pure language model.
Yeah, I’ll pile on in agreement.
I feel like thinking of the internals of transformers as doing general search—especially search over things to simulate—is some kind of fallacy. The system as a whole (the transformer) outputs a simulation of the training distribution, but that doesn’t mean it’s made of parts that themselves do simulations, or that refer to “simulating a thing” as a basic part of some internal ontology.
I think “classic” inner alignment failure (where some inner Azazel has preferences about the real world) is a procrustean bed—it fits an RL agent navigating the real world, but not so much a pure language model.