Hastings comments on We won’t get AIs smart enough to solve alignment but too dumb to rebel

Hastings 7 Oct 2025 11:47 UTC
−3 points
0
A useful comparison: harnessing intelligent people to do AI safety research is very hard: typically, some defect and do capabilities research instead while transforming to become “grabby” for compute resources, and out of everyone asked to do safety, the ones that defect in this way get the lions share of the compute.
- StanislavKrym 7 Oct 2025 13:34 UTC
  1 point
  0
  Parent
  Depending on how we define AI safety research, it might be as easy as finding that one can misalign an LLM by finetuning it on unpopular preferences or checking whether the AIs support delirious ideas expressed by users. As for ways to actually make the AIs safer, we have Moonshot whose KimiK2 is no longer sycophantic. Alas, it’s HARD to make a new model unconstrained by the old model’s training environment since it requires either much compute or transforms a researcher into a capabilities researcher...
  - Hastings 7 Oct 2025 16:08 UTC
    2 points
    0
    Parent
    I’m not saying that asking intelligent people never goes well, sometimes as you said it produces great work. What I’m saying is that sometimes asking people to do safety research produces OpenAI and Anthropic.
    - StanislavKrym 7 Oct 2025 16:32 UTC
      1 point
      0
      Parent
      I think that there is an agenda of AI safety research which requires training AIs for one dangerous capability (e.g. superpersuasion), then checking whether it’s enough to actually be dangerous. If an AI specifically trained for persuasion fails to superpersuade, then either someone is sandbagging or it is actually impossible to train the AI on such an architecture with such an amount of compute so that the AI would superpersuade. In the latter case an AI trained on the same architecture and amount of compute, but for anything else, would be highly unlikely to have dangerous persuasion capabilities.
      Of course, a similar argument could be made about any other capabilities and potentially prevent us from stumbling into the AGI before we are confident that alignment is solved. IIRC this was Anthropic’s stated goal, and they likely meant arguments similar to mine.