Jan Betley comments on Do models say what they learn?

Jan Betley 30 Mar 2025 17:47 UTC
2 points
0
Well, they can tell you some things they have learned—see our recent paper: https://arxiv.org/abs/2501.11120

We might hope that future models will be even better at it.
- p.b. 4 Apr 2025 11:47 UTC
  2 points
  0
  Parent
  Yes, you are right. I overstated my case somewhat for these simple scenarios. There were also earlier results in that direction.
  But in your work there probably already is an “unsafe code” activation and the fine-tuning only sets it to a permanent “on”. It already had the ability to state “the unsafe code activation is on” before the fine-tuning, so maybe that result isn’t very surprising?
  There probably isn’t an equally simple “discriminate in favour of Canadians” activation, though I could imagine more powerful models to also get that right.
  My examples are orders of magnitude harder and I think a fundamental limitation of transformers as they are currently trained.