Thank you for writing! I think this is a good objection.
which means I expect the LLM paradigm to be superseded, and the most likely options in this space that increase capabilities unfortunately correspond to more neuralese/recurrent architectures
Recurrent architectures aren’t necessarily worse for interp, I’m pretty fondof interpingthem and it’s fairlyeasy, IMO easier than LLMs. Though, I’m probably the alignment community’s foremost person pushing RNN interp due to lack of interest from others, so I’m biased.
I do agree that other things about future paradigms or RNNs might be worse. For example, RL will be a lot easier and compute-efficient.
The main reason I was talking about RNNs/neuralese architectures is that it fundamentally breaks the assumption that nice CoT is the large source of evidence on safety that it is today. Quite a lot of near-term safety plans assume that the CoT is where the bulk of the relevant work is, meaning it’s pretty easy to monitor.
To be precise, I’m claiming this part of your post is little evidence for safety post-LLMs:
I have gone over a bunch of the evidence that contradicts the classical doom picture. It could almost be explained away by noting that capabilities are too low to take over, if not for the fact that we can see the chains of thought and they’re by and large honest.
This also weakens the influence of pre-training priors, as the AI continually learns, compared to how AIs stop learning today after their training, which is why they have knowledge cutoff dates, meaning we can’t rely on it to automate alignment (though the human pre-training prior is actually surprisingly powerful, and I expect it to be able to complete 1-6 month tasks by 2030, which is when scaling starts to slow down most likely absent TAI arising then, which I think is reasonably likely).
Agree that interpretability might get easier, and this is a useful positive consideration to think about.
Thank you for writing! I think this is a good objection.
Recurrent architectures aren’t necessarily worse for interp, I’m pretty fond of interping them and it’s fairly easy, IMO easier than LLMs. Though, I’m probably the alignment community’s foremost person pushing RNN interp due to lack of interest from others, so I’m biased.
I do agree that other things about future paradigms or RNNs might be worse. For example, RL will be a lot easier and compute-efficient.
The main reason I was talking about RNNs/neuralese architectures is that it fundamentally breaks the assumption that nice CoT is the large source of evidence on safety that it is today. Quite a lot of near-term safety plans assume that the CoT is where the bulk of the relevant work is, meaning it’s pretty easy to monitor.
To be precise, I’m claiming this part of your post is little evidence for safety post-LLMs:
This also weakens the influence of pre-training priors, as the AI continually learns, compared to how AIs stop learning today after their training, which is why they have knowledge cutoff dates, meaning we can’t rely on it to automate alignment (though the human pre-training prior is actually surprisingly powerful, and I expect it to be able to complete 1-6 month tasks by 2030, which is when scaling starts to slow down most likely absent TAI arising then, which I think is reasonably likely).
Agree that interpretability might get easier, and this is a useful positive consideration to think about.