Logan Zoellner comments on The Plan − 2023 Version

Logan Zoellner 4 Jan 2024 16:23 UTC
4 points
0
Iteration + RLHF: RLHF actively rewards the system for hiding problems, which makes iteration less effective; we’d be better off just iterating on a raw predictive model.
I don’t think is is actually true. instruct-tuned models are much better at following instructions on real-world tasks than a “raw predictive model”.

If we’re imagining a chain of the form: human → slightly smarter than Human AGI → much smarter than human AGI → … → SAI, we almost certainly want the first AGI to be RLHF’d/DPO’s/whatever the state of the art is at the time.