I’m not saying that asking intelligent people never goes well, sometimes as you said it produces great work. What I’m saying is that sometimes asking people to do safety research produces OpenAI and Anthropic.
I think that there is an agenda of AI safety research which requires training AIs for one dangerous capability (e.g. superpersuasion), then checking whether it’s enough to actually be dangerous. If an AI specifically trained for persuasion fails to superpersuade, then either someone is sandbagging or it is actually impossible to train the AI on such an architecture with such an amount of compute so that the AI would superpersuade. In the latter case an AI trained on the same architecture and amount of compute, but for anything else, would be highly unlikely to have dangerous persuasion capabilities.
Of course, a similar argument could be made about any other capabilities and potentially prevent us from stumbling into the AGI before we are confident that alignment is solved. IIRC this was Anthropic’s stated goal, and they likely meant arguments similar to mine.
I’m not saying that asking intelligent people never goes well, sometimes as you said it produces great work. What I’m saying is that sometimes asking people to do safety research produces OpenAI and Anthropic.
I think that there is an agenda of AI safety research which requires training AIs for one dangerous capability (e.g. superpersuasion), then checking whether it’s enough to actually be dangerous. If an AI specifically trained for persuasion fails to superpersuade, then either someone is sandbagging or it is actually impossible to train the AI on such an architecture with such an amount of compute so that the AI would superpersuade. In the latter case an AI trained on the same architecture and amount of compute, but for anything else, would be highly unlikely to have dangerous persuasion capabilities.
Of course, a similar argument could be made about any other capabilities and potentially prevent us from stumbling into the AGI before we are confident that alignment is solved. IIRC this was Anthropic’s stated goal, and they likely meant arguments similar to mine.