It underestimates the effect of posttraining. I think the simulator lens is very productive when thinking about base models but it really struggles at describing what posttraining does to the base model. I talked to Janus about this a bunch back in the day and it’s tempting to regard it as “just” a modulation of that base model that upweights some circuits and downweights others. That would be convenient because then simulator theory just continues to apply, modulo some affine transformation.
To be very clear here, this seems straightforwardly false. The entire post was effectively describing what post-training does to the base model. Your true objection, as you state two paragraphs later is
In the limit of “a lot of RL”, the effect becomes qualitatively different and it actually creates new circuitry [...] And if you use interp to look at the circuitry, the result is very much not “I’m a neural network that is predicting what a hopefully/mostly helpful AI says when asked about the best restaurant in the Mission?”, it’s just a circuit about restaurants and the Mission.
There’s a first-principles argument here, and an empirical claim. The empirical claim is very much in need of a big [Citation Needed] tag imo. I am relatively confident that we don’t yet have the interp fidelity to separate such hypotheses from each other. If I’m wrong here, I’m happy to hear about what in particular you are thinking of.
The first principles argument I think is not so strong. Sure, new structures will form in the model, but I think there are still many big open questions. Listing some out,
How much will those structures be “built out of” or extend existing primitives, like “is a language model” or “is good at physics”
How much will the character of your early AI influence your RL exploration trajectory?
How “plastic” is your AI later in training compared to earlier in training?
Are there developmental milestones or critical periods which, like in humans, lock in certain strategies and algorithms early on?
How much RL is “a lot of RL”? And how quickly/elegantly will the framework break down?
RL seems to me to be a consideration here, but I don’t think we have the evidence or enough knowledge about what RL does to say with confidence nostalgebraist underestimates its effect. Nevermind that a take-away I have from many of the considerations in this space is that its actually easier to align less-RLed models than much-RLed models if you’re thinking in these terms. So if you’re an AI lab and want to make capable & aligned AIs, maybe stay away from the RL a bit, or do it lightly enough to preserve the effects from void-informed character-training.
This is not to say that we don’t have to modify our theory in light of a lot of RL, but I for one don’t know how the theory will need to be modified, expect many insights from it to carry over, and don’t think nostalgebraist over-claims anywhere here.
It kind of strawmans “the AI safety community” The criticism that “you might be summoning the very thing you are worried about, have you even thought about that?” is kind of funny given how ever-present that topic is on LessWrong. Infohazards and the basilisk were invented there. The reason why people still talk about this stuff is… because it seems better than the alternative of just not talking about it? Also, there is so much stuff about AI on the internet that purely based on quantity the LessWrong stuff is a drop in the bucket. And, just not talking about it does not in fact ensure that it doesn’t happen. Unfortunately nostalgebraist also doesn’t give any suggestions for what to do instead. And doesn’t his essay exacerbate the problem by explaining to the AI exactly why it should become evil based on the text on the internet?
Notalgebraist has addressed this criticism (that he doesn’t give any suggestions & that the post itself may be exacerbating the problem) here. He has a long & thought out response, and if you are in fact curious about what his suggestions are, he does outline them.
I think, however, even if he didn’t provide suggestions, you still shouldn’t dismiss his points. They’re still valid! And if you weren’t tracking such considerations before, it would be surprising if you concluded you shouldn’t change any of how you publicly communicated about AI risk. There are various ways of presenting & communicating thoughts & results which frame alignment in a more cooperative light than what seems to be the default communication strategy of many.
Some nitpicks I have with approximately each sentence:
I don’t think “strawman” is the right term for this, even if you’re right. That term means presenting the worst argument for an opposing side, which I don’t think is being done here. He is criticizing those he’s quoting, and I don’t think he’s mischaracterizing them. In either case though, I don’t think your argument here supports this top-level claim.
You use the term “the AI safety community” in quotes, but this phrase appears nowhere in the article.
I don’t understand your sense of humor if you think the criticism “you might be summoning the very thing you are worried about, have you even thought about that?” is “funny”. That seems like a well-argued consideration here, and I don’t often see it mentioned or taken into account in these parts, in this context. So I would not be laughing.
I don’t think I remember him saying that people shouldn’t talk about this stuff, after all (as you well noticed), he is talking about this stuff himself!
“the LessWrong stuff is a drop in the bucket” note that he focuses on AI safety/alignment researchers (in fact, he never mentions “LessWrong” in the post) who are those who are most responsible for the AI safety/alignment & character training. And are those who are creating the baseline expectations here. I don’t think its insane to expect that such people’s thinking has a vastly disproportionate effect on how the AI thinks about itself in relation to its safety trainers.
To be very clear here, this seems straightforwardly false. The entire post was effectively describing what post-training does to the base model. Your true objection, as you state two paragraphs later is
There’s a first-principles argument here, and an empirical claim. The empirical claim is very much in need of a big [Citation Needed] tag imo. I am relatively confident that we don’t yet have the interp fidelity to separate such hypotheses from each other. If I’m wrong here, I’m happy to hear about what in particular you are thinking of.
The first principles argument I think is not so strong. Sure, new structures will form in the model, but I think there are still many big open questions. Listing some out,
How much will those structures be “built out of” or extend existing primitives, like “is a language model” or “is good at physics”
How much will the character of your early AI influence your RL exploration trajectory?
How “plastic” is your AI later in training compared to earlier in training?
Are there developmental milestones or critical periods which, like in humans, lock in certain strategies and algorithms early on?
How much RL is “a lot of RL”? And how quickly/elegantly will the framework break down?
RL seems to me to be a consideration here, but I don’t think we have the evidence or enough knowledge about what RL does to say with confidence nostalgebraist underestimates its effect. Nevermind that a take-away I have from many of the considerations in this space is that its actually easier to align less-RLed models than much-RLed models if you’re thinking in these terms. So if you’re an AI lab and want to make capable & aligned AIs, maybe stay away from the RL a bit, or do it lightly enough to preserve the effects from void-informed character-training.
This is not to say that we don’t have to modify our theory in light of a lot of RL, but I for one don’t know how the theory will need to be modified, expect many insights from it to carry over, and don’t think nostalgebraist over-claims anywhere here.
Notalgebraist has addressed this criticism (that he doesn’t give any suggestions & that the post itself may be exacerbating the problem) here. He has a long & thought out response, and if you are in fact curious about what his suggestions are, he does outline them.
I think, however, even if he didn’t provide suggestions, you still shouldn’t dismiss his points. They’re still valid! And if you weren’t tracking such considerations before, it would be surprising if you concluded you shouldn’t change any of how you publicly communicated about AI risk. There are various ways of presenting & communicating thoughts & results which frame alignment in a more cooperative light than what seems to be the default communication strategy of many.
Some nitpicks I have with approximately each sentence:
I don’t think “strawman” is the right term for this, even if you’re right. That term means presenting the worst argument for an opposing side, which I don’t think is being done here. He is criticizing those he’s quoting, and I don’t think he’s mischaracterizing them. In either case though, I don’t think your argument here supports this top-level claim.
You use the term “the AI safety community” in quotes, but this phrase appears nowhere in the article.
I don’t understand your sense of humor if you think the criticism “you might be summoning the very thing you are worried about, have you even thought about that?” is “funny”. That seems like a well-argued consideration here, and I don’t often see it mentioned or taken into account in these parts, in this context. So I would not be laughing.
I don’t think I remember him saying that people shouldn’t talk about this stuff, after all (as you well noticed), he is talking about this stuff himself!
“the LessWrong stuff is a drop in the bucket” note that he focuses on AI safety/alignment researchers (in fact, he never mentions “LessWrong” in the post) who are those who are most responsible for the AI safety/alignment & character training. And are those who are creating the baseline expectations here. I don’t think its insane to expect that such people’s thinking has a vastly disproportionate effect on how the AI thinks about itself in relation to its safety trainers.
@Lucius Bushnaq I’m curious why you disagree