A useful comparison: harnessing intelligent people to do AI safety research is very hard: typically, some defect and do capabilities research instead while transforming to become “grabby” for compute resources, and out of everyone asked to do safety, the ones that defect in this way get the lions share of the compute.
Depending on how we define AI safety research, it might be as easy as finding that one can misalign an LLM by finetuning it on unpopular preferences or checking whether the AIs support delirious ideas expressed by users. As for ways to actually make the AIs safer, we have Moonshot whose KimiK2 is no longer sycophantic. Alas, it’s HARD to make a new model unconstrained by the old model’s training environment since it requires either much compute or transforms a researcher into a capabilities researcher...
I’m not saying that asking intelligent people never goes well, sometimes as you said it produces great work. What I’m saying is that sometimes asking people to do safety research produces OpenAI and Anthropic.
I think that there is an agenda of AI safety research which requires training AIs for one dangerous capability (e.g. superpersuasion), then checking whether it’s enough to actually be dangerous. If an AI specifically trained for persuasion fails to superpersuade, then either someone is sandbagging or it is actually impossible to train the AI on such an architecture with such an amount of compute so that the AI would superpersuade. In the latter case an AI trained on the same architecture and amount of compute, but for anything else, would be highly unlikely to have dangerous persuasion capabilities.
Of course, a similar argument could be made about any other capabilities and potentially prevent us from stumbling into the AGI before we are confident that alignment is solved. IIRC this was Anthropic’s stated goal, and they likely meant arguments similar to mine.
A useful comparison: harnessing intelligent people to do AI safety research is very hard: typically, some defect and do capabilities research instead while transforming to become “grabby” for compute resources, and out of everyone asked to do safety, the ones that defect in this way get the lions share of the compute.
Depending on how we define AI safety research, it might be as easy as finding that one can misalign an LLM by finetuning it on unpopular preferences or checking whether the AIs support delirious ideas expressed by users. As for ways to actually make the AIs safer, we have Moonshot whose KimiK2 is no longer sycophantic. Alas, it’s HARD to make a new model unconstrained by the old model’s training environment since it requires either much compute or transforms a researcher into a capabilities researcher...
I’m not saying that asking intelligent people never goes well, sometimes as you said it produces great work. What I’m saying is that sometimes asking people to do safety research produces OpenAI and Anthropic.
I think that there is an agenda of AI safety research which requires training AIs for one dangerous capability (e.g. superpersuasion), then checking whether it’s enough to actually be dangerous. If an AI specifically trained for persuasion fails to superpersuade, then either someone is sandbagging or it is actually impossible to train the AI on such an architecture with such an amount of compute so that the AI would superpersuade. In the latter case an AI trained on the same architecture and amount of compute, but for anything else, would be highly unlikely to have dangerous persuasion capabilities.
Of course, a similar argument could be made about any other capabilities and potentially prevent us from stumbling into the AGI before we are confident that alignment is solved. IIRC this was Anthropic’s stated goal, and they likely meant arguments similar to mine.