While I disagree with a few points here, this comment sparked a nice private dialogue between me and Ryan. Here are two things I think are true:
Ryan’s original comment (which I linked to in the post) was only making claims about sycophancy reduction. As you pointed out.
Ryan and I had some broader disagreement about whether activation additions would have anything to add to more standard techniques like finetuning. Ryan was open to the idea of it generalizing better / stacking, but was unsure. I felt quite confident that activation additions would at least stack, and possibly even outperform finetuning.
However, since the disagreement is not neatly captured in a single comment thread, and since plenty of folks disagreed with me about activation additions, I’m going to edit the original post to not single out Ryan’s points anymore. The comment of mine which I linked to was in fact intended to pertain to activation additions in general, and I have contemporaneous private messages with different folk where I claimed it would stack. So I’m leaving that part in as support of my confidence in this method.
Thanks for (indirectly) facilitating this sense-making process!
While I disagree with a few points here, this comment sparked a nice private dialogue between me and Ryan. Here are two things I think are true:
Ryan’s original comment (which I linked to in the post) was only making claims about sycophancy reduction. As you pointed out.
Ryan and I had some broader disagreement about whether activation additions would have anything to add to more standard techniques like finetuning. Ryan was open to the idea of it generalizing better / stacking, but was unsure. I felt quite confident that activation additions would at least stack, and possibly even outperform finetuning.
However, since the disagreement is not neatly captured in a single comment thread, and since plenty of folks disagreed with me about activation additions, I’m going to edit the original post to not single out Ryan’s points anymore. The comment of mine which I linked to was in fact intended to pertain to activation additions in general, and I have contemporaneous private messages with different folk where I claimed it would stack. So I’m leaving that part in as support of my confidence in this method.
Thanks for (indirectly) facilitating this sense-making process!
Thanks for this clarification!
I’ve posted a shortform with a more detailed description of my views on activation additions.