They’re trained to follow the spec, and in as far as you’d expect normal RLHF to work, you expect RL from a spec to work around as well, no?
Also
You can’t just use “misaligned” to mean maximally self-replication-seeking
Why not?
People care to explain why they disagree/downvote so much?
They’re trained to follow the spec, and in as far as you’d expect normal RLHF to work, you expect RL from a spec to work around as well, no?
Also
Why not?
People care to explain why they disagree/downvote so much?