No, it does not develop neuralese. The architecture that it is being trained on is already using neuralese.
You’re correct on the object level here, and it’s a point against Collier that the statement is incorrect, but I do think it’s important to note that a fixed version of the statement serves the same rhetorical purpose. That is, on page 123 it does develop a new mode of thinking, analogized to a different language, which causes the oversight tools to fail and also leads to an increase in capabilities. So Y&S are postulating a sudden jump in capabilities which causes oversight tools to break, in a way that a more continuous story might not have.
I think Y&S still have a good response to the repaired argument. The reason the update was adopted was because it improved capabilities—the scientific mode of reasoning was superior to the mythical mode—but there could nearly as easily have been an update which didn’t increase capabilities but scrambled the reasoning in such a way that the oversight system broke. Or the guardrails might have been cutting off too many prospective thoughts, and so the AI lab is performing a “safety test” wherein they relax the guardrails, and a situationally aware Sable generates behavior that looks behaved enough that the relaxation stays in place, and then allows for it to escape when monitored less closely.
This is about making a pretty straightforward and I think kind of inevitable argument that as you are in the domain of neuralese, your representations of concepts will diverge a lot from human concepts, and this makes supervision much harder.
I don’t think this is about ‘neuralese’, I think a basically similar story goes thru for a model that only thinks in English.
What’s happening, in my picture, is that meaning is stored in the relationships between objects, and that relationship can change in subtle ways that break oversight schemes. For example, imagine an earnest model which can be kept in line by a humorless overseer. When the model develops a sense of humor / starts to use sarcasm, the humorless overseer might not notice the meaning of the thoughts has changed.
You’re correct on the object level here, and it’s a point against Collier that the statement is incorrect, but I do think it’s important to note that a fixed version of the statement serves the same rhetorical purpose. That is, on page 123 it does develop a new mode of thinking, analogized to a different language, which causes the oversight tools to fail and also leads to an increase in capabilities. So Y&S are postulating a sudden jump in capabilities which causes oversight tools to break, in a way that a more continuous story might not have.
I think Y&S still have a good response to the repaired argument. The reason the update was adopted was because it improved capabilities—the scientific mode of reasoning was superior to the mythical mode—but there could nearly as easily have been an update which didn’t increase capabilities but scrambled the reasoning in such a way that the oversight system broke. Or the guardrails might have been cutting off too many prospective thoughts, and so the AI lab is performing a “safety test” wherein they relax the guardrails, and a situationally aware Sable generates behavior that looks behaved enough that the relaxation stays in place, and then allows for it to escape when monitored less closely.
I don’t think this is about ‘neuralese’, I think a basically similar story goes thru for a model that only thinks in English.
What’s happening, in my picture, is that meaning is stored in the relationships between objects, and that relationship can change in subtle ways that break oversight schemes. For example, imagine an earnest model which can be kept in line by a humorless overseer. When the model develops a sense of humor / starts to use sarcasm, the humorless overseer might not notice the meaning of the thoughts has changed.