if the hypothetical breakthrough of meaningfully better capabilities from neuralese happens (in whatever form), all AI companies will start making use of it
Yep, I agree on this. However, it still seems worthwhile to think about whether and when we expect neuralese to unlock meaningfully better capabilities in the first place.
Also, continual learning is analogous to neuralese
Hm, I haven’t heard this argument before. I agree that there’s a sense in which it’s analogous to neuralese, but it isn’t obvious to me that it’s analogous in a concerning way. The main reason we care about preserving legible CoTs is that it enables us to catch dangerous kinds of reasoning, e.g. reasoning about subverting safety measures. If a model doesn’t have enough serial depth to perform the relevant reasoning in a single forward pass, it would need to verbalize this reasoning in its CoT at some point regardless of whether the relevant facts were learned during pre-training or through continual learning. An exception would be coming across data that lays out a perfect strategy for subverting the most recent safety measures, but hopefully such strategies won’t be written up, at least in places where a misaligned AI could encounter it.
Even with text-only reasoning traces, there is inscrutable thinking in the activations that can span the whole reasoning trace (with global attention), it just has to remain shallow, as its depth is only given by layers, and its depth can only be extended by looping around through the text-only reasoning trace. There will be some steganography (RL makes sure of this), even if it’s not substantial, that increases the effective depth of inscrutable thinking beyond a single pass through the layers. But it stops with the reasoning trace, so this affordance is still bounded.
Continual learning greatly extends the thinking horizon, likely using inscrutable weight updates, thus in particular extending the feasible depth of inscrutable thinking that can avoid showing up in the text-only reasoning traces.
Yep, I agree on this. However, it still seems worthwhile to think about whether and when we expect neuralese to unlock meaningfully better capabilities in the first place.
Hm, I haven’t heard this argument before. I agree that there’s a sense in which it’s analogous to neuralese, but it isn’t obvious to me that it’s analogous in a concerning way. The main reason we care about preserving legible CoTs is that it enables us to catch dangerous kinds of reasoning, e.g. reasoning about subverting safety measures. If a model doesn’t have enough serial depth to perform the relevant reasoning in a single forward pass, it would need to verbalize this reasoning in its CoT at some point regardless of whether the relevant facts were learned during pre-training or through continual learning. An exception would be coming across data that lays out a perfect strategy for subverting the most recent safety measures, but hopefully such strategies won’t be written up, at least in places where a misaligned AI could encounter it.
Even with text-only reasoning traces, there is inscrutable thinking in the activations that can span the whole reasoning trace (with global attention), it just has to remain shallow, as its depth is only given by layers, and its depth can only be extended by looping around through the text-only reasoning trace. There will be some steganography (RL makes sure of this), even if it’s not substantial, that increases the effective depth of inscrutable thinking beyond a single pass through the layers. But it stops with the reasoning trace, so this affordance is still bounded.
Continual learning greatly extends the thinking horizon, likely using inscrutable weight updates, thus in particular extending the feasible depth of inscrutable thinking that can avoid showing up in the text-only reasoning traces.