The main reason I was talking about RNNs/neuralese architectures is that it fundamentally breaks the assumption that nice CoT is the large source of evidence on safety that it is today. Quite a lot of near-term safety plans assume that the CoT is where the bulk of the relevant work is, meaning it’s pretty easy to monitor.
To be precise, I’m claiming this part of your post is little evidence for safety post-LLMs:
I have gone over a bunch of the evidence that contradicts the classical doom picture. It could almost be explained away by noting that capabilities are too low to take over, if not for the fact that we can see the chains of thought and they’re by and large honest.
This also weakens the influence of pre-training priors, as the AI continually learns, compared to how AIs stop learning today after their training, which is why they have knowledge cutoff dates, meaning we can’t rely on it to automate alignment (though the human pre-training prior is actually surprisingly powerful, and I expect it to be able to complete 1-6 month tasks by 2030, which is when scaling starts to slow down most likely absent TAI arising then, which I think is reasonably likely).
Agree that interpretability might get easier, and this is a useful positive consideration to think about.
The main reason I was talking about RNNs/neuralese architectures is that it fundamentally breaks the assumption that nice CoT is the large source of evidence on safety that it is today. Quite a lot of near-term safety plans assume that the CoT is where the bulk of the relevant work is, meaning it’s pretty easy to monitor.
To be precise, I’m claiming this part of your post is little evidence for safety post-LLMs:
This also weakens the influence of pre-training priors, as the AI continually learns, compared to how AIs stop learning today after their training, which is why they have knowledge cutoff dates, meaning we can’t rely on it to automate alignment (though the human pre-training prior is actually surprisingly powerful, and I expect it to be able to complete 1-6 month tasks by 2030, which is when scaling starts to slow down most likely absent TAI arising then, which I think is reasonably likely).
Agree that interpretability might get easier, and this is a useful positive consideration to think about.