Yep, seems too expensive to do literally as stated, but right now I’m just searching for anything concrete that would fit the bill, regardless of how practical it would be to actually run. If we decided that this was what we needed, I bet we could find a good approximation, though I don’t have one right now.
And I’m not exactly sure what part of the solution this would fill—it’s not clear to me whether this alone would be either sufficient or necessary. But it does feel like it gives you real evidence about the degree of understanding that you have, so it feels like it could be a part of a solution somewhere.
I don’t have anything concrete either, but when I was exploring model editing, I was trying to think of approaches that might be able to do something like this. Particularly, I was thinking of things like concept erasure ([1], [2], [3]).
Yep, seems too expensive to do literally as stated, but right now I’m just searching for anything concrete that would fit the bill, regardless of how practical it would be to actually run. If we decided that this was what we needed, I bet we could find a good approximation, though I don’t have one right now.
And I’m not exactly sure what part of the solution this would fill—it’s not clear to me whether this alone would be either sufficient or necessary. But it does feel like it gives you real evidence about the degree of understanding that you have, so it feels like it could be a part of a solution somewhere.
I don’t have anything concrete either, but when I was exploring model editing, I was trying to think of approaches that might be able to do something like this. Particularly, I was thinking of things like concept erasure ([1], [2], [3]).