I strongly disagree. I think this is emblematic of the classic AI safety perspective/attitude, which has impeded and discouraged practical progress towards reducing AI x-risk by supporting an unnecessary and misleading emphasis on “ultimate solutions” that address the “arbitrarily intelligent agent trapped in a computer” threat model. This is an important threat model, but it is just one of many.
My question is inspired by the situation where a scaled up GPT-3-like model is fine-tuned using RL and/or reward modelling. In this case, it seems like we can honeypot the model during the initial training and have a good chance of catching it attempting a premature treacherous turn. Whether or not the model would attempt a premature treacherous turn seems to depend on several factors. A hand-wavy argument for this strategy working is: an AI should conceive of the treacherous turn strategy before the honeypot counter-strategy because a counter-strategy presupposes the strategy it counters.
There are several reasons that make this not a brilliant research opportunity. Firstly, what is and is not a honeypot is sensitively dependant on the AI’s capabilities and situation. There is no such thing as a one size fits all honeypot.
I am more sympathetic to this argument, but it doesn’t prevent us from doing research that is limited to specific situations. It also proves to much, since combining this line of reasoning with no free lunch arguments would seem to invalidate all of machine learning.
I strongly disagree.
I think this is emblematic of the classic AI safety perspective/attitude, which has impeded and discouraged practical progress towards reducing AI x-risk by supporting an unnecessary and misleading emphasis on “ultimate solutions” that address the “arbitrarily intelligent agent trapped in a computer” threat model.
This is an important threat model, but it is just one of many.
My question is inspired by the situation where a scaled up GPT-3-like model is fine-tuned using RL and/or reward modelling. In this case, it seems like we can honeypot the model during the initial training and have a good chance of catching it attempting a premature treacherous turn. Whether or not the model would attempt a premature treacherous turn seems to depend on several factors.
A hand-wavy argument for this strategy working is: an AI should conceive of the treacherous turn strategy before the honeypot counter-strategy because a counter-strategy presupposes the strategy it counters.
I am more sympathetic to this argument, but it doesn’t prevent us from doing research that is limited to specific situations. It also proves to much, since combining this line of reasoning with no free lunch arguments would seem to invalidate all of machine learning.