I might be on board if “narrowly superhuman” were simply defined differently.
“Try to use some humans to align a model in a domain where the model is better than the humans at the task”
Isn’t it something more like “the model has information sufficient to do better”? EG, in the GPT example, you can’t reliably get good medical advice from it right now, but you strongly suspect it’s possible. That’s a key feature of the whole idea, right?
Is your suggested research program better described as: find (highly capable) models with inaccessible information and get them to reveal that information? (Especially: get them to reveal the inaccessible information without using domain expertise to do so?)
I don’t feel confident enough in the frame of “inaccessible information” to say that the whole agenda is about it. It feels like a fit for “advice”, but not a fit for “writing stories” or “solving programming puzzles” (at least not an intuitive fit—you could frame it as “the model has inaccessible information about [story-writing, programming]” but it feels more awkward to me). I do agree it’s about “strongly suspecting it has the potential to do better than humans” rather than about “already being better than humans.” Basically, it’s about trying to find areas where lackluster performance seems to mostly be about “misalignment” rather than “capabilities” (recognizing those are both fuzzy terms).
Basically, it’s about trying to find areas where lackluster performance seems to mostly be about “misalignment” rather than “capabilities” (recognizing those are both fuzzy terms).
Right, ok, I like that framing better (it obviously fits, but I didn’t generate it as a description before).
I might be on board if “narrowly superhuman” were simply defined differently.
Isn’t it something more like “the model has information sufficient to do better”? EG, in the GPT example, you can’t reliably get good medical advice from it right now, but you strongly suspect it’s possible. That’s a key feature of the whole idea, right?
Is your suggested research program better described as: find (highly capable) models with inaccessible information and get them to reveal that information? (Especially: get them to reveal the inaccessible information without using domain expertise to do so?)
I don’t feel confident enough in the frame of “inaccessible information” to say that the whole agenda is about it. It feels like a fit for “advice”, but not a fit for “writing stories” or “solving programming puzzles” (at least not an intuitive fit—you could frame it as “the model has inaccessible information about [story-writing, programming]” but it feels more awkward to me). I do agree it’s about “strongly suspecting it has the potential to do better than humans” rather than about “already being better than humans.” Basically, it’s about trying to find areas where lackluster performance seems to mostly be about “misalignment” rather than “capabilities” (recognizing those are both fuzzy terms).
Right, ok, I like that framing better (it obviously fits, but I didn’t generate it as a description before).