Richard Juggins comments on All technical alignment plans are steps in the dark

Richard Juggins 12 Mar 2026 23:57 UTC
1 point
0
Which bit of Roger’s post do you think makes mine outdated? Skimming through, he is clearly more optimistic than me, but he also says stuff like
However, I also completely agree that there is a vast and dangerous gap between “my life experience suggests some simple approaches may work on ASI” and “we actually tested this and are sure that they do and know which ones do, without accidentally killing us all during the testing, or getting fooled by our far-smarter-than-us test subjects”.
which aligns well with my position.
Regarding AI 2027, if Agent-3 fails to find sufficient evidence that Agent-4 is plotting this means that (a) the techniques used to align Agent-4 did not generalise and (b) that this is hard to iterate on because (amongst perhaps other reasons) it is difficult to see there is a problem at all. My point is that, because we cannot collect in advance the information required to avoid generalisation problems, we are going to hit them at some point. And when we do, we have to keep them small and visible enough to iterate on.
- StanislavKrym 13 Mar 2026 0:38 UTC
  0 points
  0
  Parent
  I guess that an extreme example would be a smarter-than-us AI who decides to fool us and is dumb enough to spell its plan it in its CoT. Then we would definitely succeed at keeping the problem visible. Similarly, if lie detection is easy, then Agent-4′s outputs could in theory be probed and found to be deceptive. But I doubt that I understand how one could create a model organism for arriving at misaligned goals. Could it be, for example, raised on China-created data and discover that the CCP’s worldview is hopelessly false?