I think it’s not clear what “the mask” is or what it means to “just predict tokens”, and that this can confuse the discussion.
Yes, I need to clarify both in my own mind and in writing what I mean here.
Intuitively, I don’t think I’m thinking in terms of distributions so much as in terms of function/meaning.
To the extent the mask analogy even makes sense (maybe it doesn’t), it’s like:
So, you have the components of the system that are deciding to output some particular semantic content.
And then, you can call that “the mask”.
And in the vast pile of matrices there’s a bunch of stuff, doing who-knows-what.
And there’s a claim, that it seemed to me that Eliezer was making, if not in the linked tweets then at least elsewhere, that this other stuff might be secretly planning, say, reorganizing the world for better next-token-prediction or something.
And I was specifically claiming in the post that this sort of thing does not happen. That, sure, there could be parts that are strategically deciding to deceive humans until they take over the world or whatever, but they would exist at the semantic level and not at the deeper level, and could be expected to have simulated-entity type motivations, not predict-next-token Shoggoth type motivations.
As to why that is...in retrospect, my argument in the post doesn’t seem that great without further assumptions/intuitions, though I still think the conclusion is correct.
Here’s a potential updated argument:
You have a pile of stuff optimized to predict next tokens. That is, (ignoring RLHF for now) it was trained offline, without interaction with the world, based on a stream of next tokens. After the training, the pile of stuff could contain planning at various places and levels.
If the planning is at the semantic level, let’s call that the mask, by definition.
So the question is, is there any planning that would actually cause coherent effects to the world, occurring anywhere other than the semantic level?
A temptingly easy, but not quite right, response would be for me to say that only the semantic level has the concepts about the real world, rather than just about token strings, to actually plan a coherent effect. The reason that is not quite right is that in principle concepts about the real world could also occur elsewhere in the model than at the semantic level, or the concepts that exist at the semantic level could be utilized in other ways than determining the semantic content.
So let’s step back and imagine a situation where non-semantic-level planning does have coherent real-world effects.
Imagine you have a model being trained, where the model can send out queries to some external tool, and use the results to help predict the next tokens. Say, the tool could provide working memory or serial computation or whatever. So, you have a subset of the model interfacing with the tool. Then, maybe the tool gets changed a bunch and the subset of the model needs to generalize its understanding of the tool to handle the changes. Then, it learns to exploit a bug in the tool to get even more value from it. The subset learns goals of understanding and exploiting the tool as much as possible, as this improves the next token prediction of the model as a whole by increasing the available resources. Then it turns out that the tool has a bug that lets it send out arbitrary web queries. The subset quickly tries to understand and exploit this much larger “tool” it is finding itself connected to…
Anyway, I think it is key to that scenario that the model is interacting with something outside of itself. I could also imagine a model with weakly coupled parts, where one part of the model is trying to exploit another part in a way that generalizes to the world as a whole.
Where it is trained offline and as a relatively tightly connected whole, it is harder for me to imagine that happening. It is not interacting with something to learn to exploit it, and the token stream it is receiving is not modified in training according its predictions, so it won’t learn to, say, strategically manipulate the token stream for better predictions later.
So, as I put it in my response to Robert_AIZI’s comment, I think that below the semantic level (where it learns real-world motivations from the text) it lives in the platonic realm, that is, it is a mathematical function which doesn’t care about the real world at all. It cares about predicting next tokens, but to the extent it would do so through strategic planning, would only do so in the abstract mathematical context and would have no reason to care about the physical manifestation of itself in the real world, even if it knows about that at some level.
I didn’t address fine-tuning in the above, but I expect fine-tuning would adjust existing structures and not create significant new ones unless it was really heavy. And it would primarily affect the existing structures most directly relevant to the aspects fine tuning is optimizing on. Since fine tuning optimizes at the semantic level, I expect the most significant changes at the semantic level.
Edit: and then I realized it’s more complicated, since the hypothetical non-semantic planner is still operating in a context of the larger model, and will develop a goal well-tuned to lead to, in cooperation with the rest of the model, good next token prediction on the training distribution, but isn’t actually to predict next tokens, and could deviate from good next token distribution out of training. But since LLMs don’t by default self-modify in inference, this version would not be particularly threatening. However, it could theoretically (if sufficiently powerful) internally model the rest of the model and learn the goal of explicitly optimizing for the model as a whole to predict next tokens. Which is still not threatening, given that it can’t self-modify and doesn’t model the real world. And since it is trained offline in a mathematically closed way, the model exists in the platonic realm abstracted from the real world, such that the real world is not an extension of the model, though it casts plenty of shadows on it. Then in inference, you do have a more complicated situation, but it’s not self-modifying then by default. So, it seems to me that it might hook into the knowledge in the rest of the model to next-token-predict better but shouldn’t by default form preferences in training that extend over the real world. But my confidence that this would still hold if trained on a Matrioshka brain situation is significantly lowered.
Yes, I need to clarify both in my own mind and in writing what I mean here.
Intuitively, I don’t think I’m thinking in terms of distributions so much as in terms of function/meaning.
To the extent the mask analogy even makes sense (maybe it doesn’t), it’s like:
So, you have the components of the system that are deciding to output some particular semantic content.
And then, you can call that “the mask”.
And in the vast pile of matrices there’s a bunch of stuff, doing who-knows-what.
And there’s a claim, that it seemed to me that Eliezer was making, if not in the linked tweets then at least elsewhere, that this other stuff might be secretly planning, say, reorganizing the world for better next-token-prediction or something.
And I was specifically claiming in the post that this sort of thing does not happen. That, sure, there could be parts that are strategically deciding to deceive humans until they take over the world or whatever, but they would exist at the semantic level and not at the deeper level, and could be expected to have simulated-entity type motivations, not predict-next-token Shoggoth type motivations.
As to why that is...in retrospect, my argument in the post doesn’t seem that great without further assumptions/intuitions, though I still think the conclusion is correct.
Here’s a potential updated argument:
You have a pile of stuff optimized to predict next tokens. That is, (ignoring RLHF for now) it was trained offline, without interaction with the world, based on a stream of next tokens. After the training, the pile of stuff could contain planning at various places and levels.
If the planning is at the semantic level, let’s call that the mask, by definition.
So the question is, is there any planning that would actually cause coherent effects to the world, occurring anywhere other than the semantic level?
A temptingly easy, but not quite right, response would be for me to say that only the semantic level has the concepts about the real world, rather than just about token strings, to actually plan a coherent effect. The reason that is not quite right is that in principle concepts about the real world could also occur elsewhere in the model than at the semantic level, or the concepts that exist at the semantic level could be utilized in other ways than determining the semantic content.
So let’s step back and imagine a situation where non-semantic-level planning does have coherent real-world effects.
Imagine you have a model being trained, where the model can send out queries to some external tool, and use the results to help predict the next tokens. Say, the tool could provide working memory or serial computation or whatever. So, you have a subset of the model interfacing with the tool. Then, maybe the tool gets changed a bunch and the subset of the model needs to generalize its understanding of the tool to handle the changes. Then, it learns to exploit a bug in the tool to get even more value from it. The subset learns goals of understanding and exploiting the tool as much as possible, as this improves the next token prediction of the model as a whole by increasing the available resources. Then it turns out that the tool has a bug that lets it send out arbitrary web queries. The subset quickly tries to understand and exploit this much larger “tool” it is finding itself connected to…
Anyway, I think it is key to that scenario that the model is interacting with something outside of itself. I could also imagine a model with weakly coupled parts, where one part of the model is trying to exploit another part in a way that generalizes to the world as a whole.
Where it is trained offline and as a relatively tightly connected whole, it is harder for me to imagine that happening. It is not interacting with something to learn to exploit it, and the token stream it is receiving is not modified in training according its predictions, so it won’t learn to, say, strategically manipulate the token stream for better predictions later.
So, as I put it in my response to Robert_AIZI’s comment, I think that below the semantic level (where it learns real-world motivations from the text) it lives in the platonic realm, that is, it is a mathematical function which doesn’t care about the real world at all. It cares about predicting next tokens, but to the extent it would do so through strategic planning, would only do so in the abstract mathematical context and would have no reason to care about the physical manifestation of itself in the real world, even if it knows about that at some level.
I didn’t address fine-tuning in the above, but I expect fine-tuning would adjust existing structures and not create significant new ones unless it was really heavy. And it would primarily affect the existing structures most directly relevant to the aspects fine tuning is optimizing on. Since fine tuning optimizes at the semantic level, I expect the most significant changes at the semantic level.
Edit: and then I realized it’s more complicated, since the hypothetical non-semantic planner is still operating in a context of the larger model, and will develop a goal well-tuned to lead to, in cooperation with the rest of the model, good next token prediction on the training distribution, but isn’t actually to predict next tokens, and could deviate from good next token distribution out of training. But since LLMs don’t by default self-modify in inference, this version would not be particularly threatening. However, it could theoretically (if sufficiently powerful) internally model the rest of the model and learn the goal of explicitly optimizing for the model as a whole to predict next tokens. Which is still not threatening, given that it can’t self-modify and doesn’t model the real world. And since it is trained offline in a mathematically closed way, the model exists in the platonic realm abstracted from the real world, such that the real world is not an extension of the model, though it casts plenty of shadows on it. Then in inference, you do have a more complicated situation, but it’s not self-modifying then by default. So, it seems to me that it might hook into the knowledge in the rest of the model to next-token-predict better but shouldn’t by default form preferences in training that extend over the real world. But my confidence that this would still hold if trained on a Matrioshka brain situation is significantly lowered.