p.b. comments on Do models say what they learn?

p.b. 4 Apr 2025 11:47 UTC
2 points
0
Yes, you are right. I overstated my case somewhat for these simple scenarios. There were also earlier results in that direction.
But in your work there probably already is an “unsafe code” activation and the fine-tuning only sets it to a permanent “on”. It already had the ability to state “the unsafe code activation is on” before the fine-tuning, so maybe that result isn’t very surprising?
There probably isn’t an equally simple “discriminate in favour of Canadians” activation, though I could imagine more powerful models to also get that right.
My examples are orders of magnitude harder and I think a fundamental limitation of transformers as they are currently trained.