Thanks, this is a good point and definitely worth adding to the list!
Assuming that this is useful for capabilities, I’d imagine that neuralese models would also be able to eventually develop mechanisms to reflect on their own reasoning processes. The kind of introspection that Anthropic recently studied might be seen as one precursor capability for this and as evidence that this is incentivized by the training process, even if the model also has the ability to reflect on the CoT. But of course, the ability to reflect on internal activations in a deep way might be much harder to develop than the ability to do this with a legible CoT.
Thanks, this is a good point and definitely worth adding to the list!
Assuming that this is useful for capabilities, I’d imagine that neuralese models would also be able to eventually develop mechanisms to reflect on their own reasoning processes. The kind of introspection that Anthropic recently studied might be seen as one precursor capability for this and as evidence that this is incentivized by the training process, even if the model also has the ability to reflect on the CoT. But of course, the ability to reflect on internal activations in a deep way might be much harder to develop than the ability to do this with a legible CoT.