Iteration + RLHF: RLHF actively rewards the system for hiding problems, which makes iteration less effective; we’d be better off just iterating on a raw predictive model.
I don’t think is is actually true. instruct-tuned models are much better at following instructions on real-world tasks than a “raw predictive model”.
If we’re imagining a chain of the form: human → slightly smarter than Human AGI → much smarter than human AGI → … → SAI, we almost certainly want the first AGI to be RLHF’d/DPO’s/whatever the state of the art is at the time.
I don’t think is is actually true. instruct-tuned models are much better at following instructions on real-world tasks than a “raw predictive model”.
If we’re imagining a chain of the form: human → slightly smarter than Human AGI → much smarter than human AGI → … → SAI, we almost certainly want the first AGI to be RLHF’d/DPO’s/whatever the state of the art is at the time.