Cool stuff! I’ve been concerned inoculation prompting just sticks all of its misaligned behaviour behind a backdoor/conditionalisation so it’s useful to know this is occasionally the case.
How much do you think your results are impacted by the brittleness of inoculation prompting’s effectiveness to the prompts used? For example, how likely do you think it is that the cases where rephrasing prompts did not reduce the negative traits and improve the positive traits could have done so if you’d used a different set of rephrased prompts? That you were able to sometimes inoculate against only the negative train in the trait pairs (and never only the negative) is quite encouraging though, and I think it’d be really valuable to better understand when and why this occurs, and why it’s comparatively rare.
That you were able to sometimes inoculate against only the negative train in the trait pairs (and never only the negative) is quite encouraging though, and I think it’d be really valuable to better understand when and why this occurs, and why it’s comparatively rare.
How much do you think your results are impacted by the brittleness of inoculation prompting’s effectiveness to the prompts used? For example, how likely do you think it is that the cases where rephrasing prompts did not reduce the negative traits and improve the positive traits could have done so if you’d used a different set of rephrased prompts
I don’t know. It would indeed be interesting to look into that, e.g., running inoculation with 10 different inoculation prompts, and then an additional one in which the 10 are used randomly.
Cool stuff! I’ve been concerned inoculation prompting just sticks all of its misaligned behaviour behind a backdoor/conditionalisation so it’s useful to know this is occasionally the case.
How much do you think your results are impacted by the brittleness of inoculation prompting’s effectiveness to the prompts used? For example, how likely do you think it is that the cases where rephrasing prompts did not reduce the negative traits and improve the positive traits could have done so if you’d used a different set of rephrased prompts? That you were able to sometimes inoculate against only the negative train in the trait pairs (and never only the negative) is quite encouraging though, and I think it’d be really valuable to better understand when and why this occurs, and why it’s comparatively rare.
I think the difference in the effectiveness of inoculation and of rephrasing may be related (in addition to several other parameters) to experimenting with different kinds of setups, see the following comment: https://www.lesswrong.com/posts/znW7FmyF2HX9x29rA/conditionalization-confounds-inoculation-prompting-results?commentId=htwYz7cvMhwaSnyRg
I don’t know. It would indeed be interesting to look into that, e.g., running inoculation with 10 different inoculation prompts, and then an additional one in which the 10 are used randomly.