Alvin Ånestrand comments on Why Future AIs will Require New Alignment Methods

Alvin Ånestrand 11 Oct 2025 16:12 UTC
2 points
0
I would argue that we can’t trust the paragraph-limited AI’s expressed preferences about character development, even if we knew it was trying to be honest. It would probably not be able to accurately report how it would behave if it actually was capable of writing books. Such capabilities are too far from its level.

It’s like the example with planning. Sure, current AIs can plan, but the plans are disconnected from task completion until they can take more active roles in executing their plans. Their planning is only aligned at a shallow level.
- StanislavKrym 11 Oct 2025 17:38 UTC
  1 point
  0
  Parent
  Suppose that Claude Sonnet N mostly prefers to play as a pacifist. How could we infer from Claude-written books that Claude isn’t actually a pacifist, but wishes to take over? Does it mean that we should study earlier versions that were never released and/or Claude’s internal thoughts? Or Claude-generated images on which no one ever did RLHF?