Thanks, nitpicking appreciated! I haven’t read the ‘recontextualization’ work. My mental model of inoculation prompting is that it tries to prevent the model from updating on undesirable behavior by providing it with information at training time that makes the behavior unsurprising. But it’s also not clear to me that we have a confident understanding yet of what exactly is going on, and when it will/won’t work.
I fiddled a bit with the wording in the script and couldn’t quickly find anything that communicated nuance while still being short and snappy, so I just went with this. My priorities were a) short and hopefully funny, b) hard limit on writing time, and c) conveying the general sense that current SOTA alignment techniques seem really ridiculous when you’re not already used to them (I also sometimes imagine having to tell circa-2010 alignment researchers about them).
Nuance went out the window in the face of the other constraints :)
Thanks, nitpicking appreciated! I haven’t read the ‘recontextualization’ work. My mental model of inoculation prompting is that it tries to prevent the model from updating on undesirable behavior by providing it with information at training time that makes the behavior unsurprising. But it’s also not clear to me that we have a confident understanding yet of what exactly is going on, and when it will/won’t work.
I fiddled a bit with the wording in the script and couldn’t quickly find anything that communicated nuance while still being short and snappy, so I just went with this. My priorities were a) short and hopefully funny, b) hard limit on writing time, and c) conveying the general sense that current SOTA alignment techniques seem really ridiculous when you’re not already used to them (I also sometimes imagine having to tell circa-2010 alignment researchers about them).
Nuance went out the window in the face of the other constraints :)