n cases like Sydney, where the public was able to see more of the messy details behind the surface-level polish
I feel like there are many more recent examples to use besides this, e.g. ChatGPT’s sycophancy despite being trained and instructed not to be sycophantic & despite passing various evals internally.
Also did Sydney reveal messy internal details? Not sure it revealed much internal details.
If you’re referring to GlazeFest: that was in April, after this piece was published (maybe started end of February [?] just as we were finalizing this; in any case, it missed the window for inclusion). Of course, there were sycophancy issues before then (by my lights, at least), but they were the kind of thing you had to tell a relatively long story about to make land for most people, rather than the kind of thing that had a meaningful foothold in broad public awareness.
The sense in which ‘messy details’ is meant here is more like ‘the relatively chaotic behavior belying some underlyingchaos’ as opposed to the ‘friendly/helpful’ veneer. Not ‘technical details’, which I think is the kind of thing you’re pointing at. Like, it just demonstrated how wide, strange, potentially unfriendly the action space of the models is; it didn’t leak anything, but primed people to ask ‘how the hell could this happen?’ (when, to safetyists, it was unsurprising).
Aside, anecdote: a lot of my concern/early intuitions for this cropped up when I was among the ‘H’ in the RLHF for GPT-3. Things like bliss attractor happened, things like horror attractor happened. It would be asked to write a summary of some smutty internet short story and send back:
He contemplated the futility of doing anything at all. He contemplated the futility of doing anything at all. He contemplated the futility of doing anything at all. He contemplated the futility of doing anything at all. He contemplated the futility of doing anything at all. He contemplated the futility of doing anything at all. He contemplated the futility of doing anything at all. He contemplated the futility of doing anything at all. He contemplated the futility of doing...
This had nothing to do with the contents of the short story, and primed me to think about how messy these things must be on the inside! Sydney provided something like that experience for a wider variety of people. Could be the idiomatic use of ‘messy details’ in the sentence you’re referring to was a mistake (current guess is we probably won’t change it; I don’t think [my guess at] your interpretation of that line is one many will/have had).
No; I have made modifications based on the comments we’ve received so far, and we may make more. The bottleneck is ‘lots of stakeholders/considerations/is-already-pretty-optimized’ and ‘these same people are coordinating a book launch slated to happen in six weeks.’
Edit: also the only substantive comment so far that clears my bar for petitioning for some kind of change is Ryan’s.
I feel like there are many more recent examples to use besides this, e.g. ChatGPT’s sycophancy despite being trained and instructed not to be sycophantic & despite passing various evals internally.
Also did Sydney reveal messy internal details? Not sure it revealed much internal details.
If you’re referring to GlazeFest: that was in April, after this piece was published (maybe started end of February [?] just as we were finalizing this; in any case, it missed the window for inclusion). Of course, there were sycophancy issues before then (by my lights, at least), but they were the kind of thing you had to tell a relatively long story about to make land for most people, rather than the kind of thing that had a meaningful foothold in broad public awareness.
The sense in which ‘messy details’ is meant here is more like ‘the relatively chaotic behavior belying some underlying chaos’ as opposed to the ‘friendly/helpful’ veneer. Not ‘technical details’, which I think is the kind of thing you’re pointing at. Like, it just demonstrated how wide, strange, potentially unfriendly the action space of the models is; it didn’t leak anything, but primed people to ask ‘how the hell could this happen?’ (when, to safetyists, it was unsurprising).
Aside, anecdote: a lot of my concern/early intuitions for this cropped up when I was among the ‘H’ in the RLHF for GPT-3. Things like bliss attractor happened, things like horror attractor happened. It would be asked to write a summary of some smutty internet short story and send back:
This had nothing to do with the contents of the short story, and primed me to think about how messy these things must be on the inside! Sydney provided something like that experience for a wider variety of people. Could be the idiomatic use of ‘messy details’ in the sentence you’re referring to was a mistake (current guess is we probably won’t change it; I don’t think [my guess at] your interpretation of that line is one many will/have had).
is the piece unable to be modified?
No; I have made modifications based on the comments we’ve received so far, and we may make more. The bottleneck is ‘lots of stakeholders/considerations/is-already-pretty-optimized’ and ‘these same people are coordinating a book launch slated to happen in six weeks.’
Edit: also the only substantive comment so far that clears my bar for petitioning for some kind of change is Ryan’s.