Super thoughtful post!
I get the feeling that I’m more optimistic about post-hoc interpretability approaches working well in the case of advanced AIs. I’m referring to the ability of an advanced AI in the form of a super large neural network-based agent to take another super large neural network-based agent and verify its commitment successfully. I think this is at least somewhat likely to work by default (i.e. scrutinizing advanced neural network-based AIs may be easier than obfuscating intentions). I also think this may potentially not require that much information about the training method and training data.
I thought before that this doesn’t matter in practice because of possibility of self-modification and successor agents. But I now think that at least in some range of potential situations verifying the behavior of a neural network seems enough for credible commitment when an agent pre-commits to using this neural network e.g. via a blockchain.
Also, are you sure that the fact that people can’t simulate nematodes fits well in this argument? I may well be mistaken but I thought that we do not really have neural network weights for nematodes, we only have the architecture. In this case it seems natural that we can’t do forward passes.
I felt initially cold towards the whole article, but now I mostly agree.
The goals of text agents might be programmable by humans directly (consider the economic pressure towards creating natural language support agents / recommendation systems / educators / etc). Prompts in their current form 1) only have significant influence over short text window after the prompt and 2) only cause likely text continuations to emerge (whereas you might want to write a text that has low probability conditional on the prompt to achieve your goal). Prompts could be replaced by specific programs by modifying the processes of training and inference. For example, additional sources of self-supervision can be incorporated (debate, or consistency losses).
I would name chain letters as the closest analogue. Another one is computer viruses (because humans design viruses with a goal in mind, and then viruses might achieve these goals and self-replicate).