I thought the idea was that steering unsupervisedly-learned abstractions circumvents failure modes of optimizing against human feedback.
I thought the idea was that steering unsupervisedly-learned abstractions circumvents failure modes of optimizing against human feedback.