I appreciate you revisiting another couple of months later! This topic continues to evolve.
It was also an update for me that merely-RLHF-ish training on top of GPTs appears to have been enough (in latest-generation models at least) to get something mostly-consistently ‘actually trying’ in an apparently coherent way. I expected that to happen; I didn’t know when. The lede which they somewhat buried in that paper was also that,
Furthermore, we find that this compliance gap persists in our synthetic document fine-tuned
setup even when we entirely remove the model’s hidden chain-of-thought scratchpad
(though the gap is smaller and we don’t have the tools to understand the mechanism)
The bigger update from that paper, for me, was that the somewhat-coherent goals apparently-pursued by these things actually align with the target constitution somewhat! i.e. (that which was) reward(ed) was the optimisation target after all, a bit??
We’ll see how robustly these sorts of findings hold up to future developments and greater elicitation efforts.
I appreciate you revisiting another couple of months later! This topic continues to evolve.
It was also an update for me that merely-RLHF-ish training on top of GPTs appears to have been enough (in latest-generation models at least) to get something mostly-consistently ‘actually trying’ in an apparently coherent way. I expected that to happen; I didn’t know when. The lede which they somewhat buried in that paper was also that,
(though the gap is smaller and we don’t have the tools to understand the mechanism)
The bigger update from that paper, for me, was that the somewhat-coherent goals apparently-pursued by these things actually align with the target constitution somewhat! i.e. (that which was) reward(ed) was the optimisation target after all, a bit??
We’ll see how robustly these sorts of findings hold up to future developments and greater elicitation efforts.