Yes but the question of whether pretrained LLMs have good representations of our values and/or our preferences and the concept of deference/obedience is still quite important for whether they become aligned. If they don’t, then aligning them via fine tuning after the fact seems quite hard. If they do, it seems pretty plausible to me that eg RLHF fine tuning or something like Anthropic’s constitutional AI finds the solution of “link the values/obedience representations to the output in a way that causes aligned behavior,” because this is simple and attains lower loss than misaligned paths. This in turn is because in order for it to be misaligned and attain loss, it must be deceptively aligned, but in deceptive alignment requires a combination of good situational awareness, a fully consequentialist objective, and high quality planning/deception skills.
Yes but the question of whether pretrained LLMs have good representations of our values and/or our preferences and the concept of deference/obedience is still quite important for whether they become aligned. If they don’t, then aligning them via fine tuning after the fact seems quite hard. If they do, it seems pretty plausible to me that eg RLHF fine tuning or something like Anthropic’s constitutional AI finds the solution of “link the values/obedience representations to the output in a way that causes aligned behavior,” because this is simple and attains lower loss than misaligned paths. This in turn is because in order for it to be misaligned and attain loss, it must be deceptively aligned, but in deceptive alignment requires a combination of good situational awareness, a fully consequentialist objective, and high quality planning/deception skills.