Well-argued throughout, but I want to focus on the first sentence:
“Though there are certainly some issues, I think most current large language models are pretty well aligned.”
Can advocates of the more pessimistic safety view find common ground on this point?
I often see statements like “We have no idea how to align AI,” sometimes accompanied by examples of alignment failures. But these seem to boil down either to the claim that LLMs are not perfectly aligned, or else they appear contradicted by the day-to-day experience of actually using them.
I also wish pessimists would more directly engage with a key idea underlying the sections on “Misaligned personas” and “Misalignment from long-horizon RL.” Specifically:
If a model were to develop a “misaligned persona,” how would such a persona succeed during training?
On what kinds of tasks would it outperform aligned behavior?
Or is the claim that a misaligned persona could arise despite performing worse during training?
I would find it helpful to understand the mechanism that pessimists envision here.
Well-argued throughout, but I want to focus on the first sentence:
Can advocates of the more pessimistic safety view find common ground on this point?
I often see statements like “We have no idea how to align AI,” sometimes accompanied by examples of alignment failures. But these seem to boil down either to the claim that LLMs are not perfectly aligned, or else they appear contradicted by the day-to-day experience of actually using them.
I also wish pessimists would more directly engage with a key idea underlying the sections on “Misaligned personas” and “Misalignment from long-horizon RL.” Specifically:
If a model were to develop a “misaligned persona,” how would such a persona succeed during training?
On what kinds of tasks would it outperform aligned behavior?
Or is the claim that a misaligned persona could arise despite performing worse during training?
I would find it helpful to understand the mechanism that pessimists envision here.