So I’ve become more pessimistic about this sort of outcome happening over the last year, and a lot of it comes down to me believing in longer timelines, which means I expect the LLM paradigm to be superseded, and the most likely options in this space that increase capabilities unfortunately correspond to more neuralese/recurrent architectures. To be clear, they’re hard enough to make work that I don’t expect them to outperform transformers so long as transformers can keep scaling, and even after the immense scale-up of AI that will slow down to the Moore’s law trend by 2030-2031, I still expect neuralese to be at least somewhat difficult, and it’s possibly difficult enough that we might survive because AI never reached the capabilities we expect.
(Yes, the scaling hypothesis in a weak form will survive, but I don’t expect the strong versions of the scaling hypothesis to work. Reasons for why I believe this are available on request).
It’s still very possible this happens, but I wouldn’t put much weight on this for planning purposes.
I agree with a weaker version of the claim that says that the AI safety landscape is looking better than people thought 10-15 years ago, with AI control probably being the best example here. To be clear, I do think this is actually meaningful and does matter, primarily because they focus less on the limiting cases of AI competencies, but I currently am not as optimistic about some of the important properties of LLMs that are relevant for alignment surviving for newer AI designs, which means I disagree with this post.
Thank you for writing! I think this is a good objection.
which means I expect the LLM paradigm to be superseded, and the most likely options in this space that increase capabilities unfortunately correspond to more neuralese/recurrent architectures
Recurrent architectures aren’t necessarily worse for interp, I’m pretty fondof interpingthem and it’s fairlyeasy, IMO easier than LLMs. Though, I’m probably the alignment community’s foremost person pushing RNN interp due to lack of interest from others, so I’m biased.
I do agree that other things about future paradigms or RNNs might be worse. For example, RL will be a lot easier and compute-efficient.
The main reason I was talking about RNNs/neuralese architectures is that it fundamentally breaks the assumption that nice CoT is the large source of evidence on safety that it is today. Quite a lot of near-term safety plans assume that the CoT is where the bulk of the relevant work is, meaning it’s pretty easy to monitor.
To be precise, I’m claiming this part of your post is little evidence for safety post-LLMs:
I have gone over a bunch of the evidence that contradicts the classical doom picture. It could almost be explained away by noting that capabilities are too low to take over, if not for the fact that we can see the chains of thought and they’re by and large honest.
This also weakens the influence of pre-training priors, as the AI continually learns, compared to how AIs stop learning today after their training, which is why they have knowledge cutoff dates, meaning we can’t rely on it to automate alignment (though the human pre-training prior is actually surprisingly powerful, and I expect it to be able to complete 1-6 month tasks by 2030, which is when scaling starts to slow down most likely absent TAI arising then, which I think is reasonably likely).
Agree that interpretability might get easier, and this is a useful positive consideration to think about.
So I’ve become more pessimistic about this sort of outcome happening over the last year, and a lot of it comes down to me believing in longer timelines, which means I expect the LLM paradigm to be superseded, and the most likely options in this space that increase capabilities unfortunately correspond to more neuralese/recurrent architectures. To be clear, they’re hard enough to make work that I don’t expect them to outperform transformers so long as transformers can keep scaling, and even after the immense scale-up of AI that will slow down to the Moore’s law trend by 2030-2031, I still expect neuralese to be at least somewhat difficult, and it’s possibly difficult enough that we might survive because AI never reached the capabilities we expect.
(Yes, the scaling hypothesis in a weak form will survive, but I don’t expect the strong versions of the scaling hypothesis to work. Reasons for why I believe this are available on request).
It’s still very possible this happens, but I wouldn’t put much weight on this for planning purposes.
I agree with a weaker version of the claim that says that the AI safety landscape is looking better than people thought 10-15 years ago, with AI control probably being the best example here. To be clear, I do think this is actually meaningful and does matter, primarily because they focus less on the limiting cases of AI competencies, but I currently am not as optimistic about some of the important properties of LLMs that are relevant for alignment surviving for newer AI designs, which means I disagree with this post.
Thank you for writing! I think this is a good objection.
Recurrent architectures aren’t necessarily worse for interp, I’m pretty fond of interping them and it’s fairly easy, IMO easier than LLMs. Though, I’m probably the alignment community’s foremost person pushing RNN interp due to lack of interest from others, so I’m biased.
I do agree that other things about future paradigms or RNNs might be worse. For example, RL will be a lot easier and compute-efficient.
The main reason I was talking about RNNs/neuralese architectures is that it fundamentally breaks the assumption that nice CoT is the large source of evidence on safety that it is today. Quite a lot of near-term safety plans assume that the CoT is where the bulk of the relevant work is, meaning it’s pretty easy to monitor.
To be precise, I’m claiming this part of your post is little evidence for safety post-LLMs:
This also weakens the influence of pre-training priors, as the AI continually learns, compared to how AIs stop learning today after their training, which is why they have knowledge cutoff dates, meaning we can’t rely on it to automate alignment (though the human pre-training prior is actually surprisingly powerful, and I expect it to be able to complete 1-6 month tasks by 2030, which is when scaling starts to slow down most likely absent TAI arising then, which I think is reasonably likely).
Agree that interpretability might get easier, and this is a useful positive consideration to think about.