I wonder if this is relevant to the SGTM paper Anthropic put out recently? Could this be done to reverse the ablation they did? That is, because this method seems to not rely as much on old information, which Anthropic asserts they destroyed. Separately, I wonder what happens when you ablate and patch serially several times? Hundreds of times?
Also, the nature of the patch here strongly reminds me of neuroplasticity, and how when a region of the brain is damaged, other regions adjust to pick up the slack.
Jenna, thank you for commenting and sharing that paper. I read through it and it is quite closely related to my solo work (albeit in opposing directions). Indeed if the old information is not utilized by the patch, then Anthropic’s SGTM safety measures might be less permanent than they might think. My experiment suggests the new circuit is orthogonal/distributed rather than restoring the old weight. I hypothesize what you’re hinting is a new Red Team attack vector: instead of healing the ablated damage(which SGTM defends against), an attacker could utilize this method to graft a cheap, sparse patch to bypass the damage entirely.
As for your latter curiosity, I believe the default case would be total model collapse once the model runs out of lazy/spare neurons to repurpose. Of course, I could be wrong and instead a hyper-robust model could be created, following your intuition. I also really like your neuroplasticity analogy! It is very fitting given the model does not fix dead tissue, it reroutes the function to new areas of the brain.
P.S. Unfortunately, I am without a GPU cluster, so perhaps you can carry forth the torch :)
I wonder if this is relevant to the SGTM paper Anthropic put out recently? Could this be done to reverse the ablation they did? That is, because this method seems to not rely as much on old information, which Anthropic asserts they destroyed. Separately, I wonder what happens when you ablate and patch serially several times? Hundreds of times?
Also, the nature of the patch here strongly reminds me of neuroplasticity, and how when a region of the brain is damaged, other regions adjust to pick up the slack.
Jenna, thank you for commenting and sharing that paper. I read through it and it is quite closely related to my solo work (albeit in opposing directions). Indeed if the old information is not utilized by the patch, then Anthropic’s SGTM safety measures might be less permanent than they might think. My experiment suggests the new circuit is orthogonal/distributed rather than restoring the old weight. I hypothesize what you’re hinting is a new Red Team attack vector: instead of healing the ablated damage(which SGTM defends against), an attacker could utilize this method to graft a cheap, sparse patch to bypass the damage entirely.
As for your latter curiosity, I believe the default case would be total model collapse once the model runs out of lazy/spare neurons to repurpose. Of course, I could be wrong and instead a hyper-robust model could be created, following your intuition. I also really like your neuroplasticity analogy! It is very fitting given the model does not fix dead tissue, it reroutes the function to new areas of the brain.
P.S. Unfortunately, I am without a GPU cluster, so perhaps you can carry forth the torch :)