To answer my own question: They usually don’t. Models don’t have “conscious access” to the skills and knowledge implicit in their sequence prediction abilities.
If you train a model on text and on videos, they lack all ability to talk sensibly about videos. To gain that ability they need to also train on data that bridges these modalities.
If things were otherwise we would be a lot closer to AGI. Gemini would have been a step change. We would be able to gain significant insights in all kinds of data by training an LLM on it.
Therefore it is not surprising that models don’t say what they learn. They don’t know what they learn.
Yes, you are right. I overstated my case somewhat for these simple scenarios. There were also earlier results in that direction.
But in your work there probably already is an “unsafe code” activation and the fine-tuning only sets it to a permanent “on”. It already had the ability to state “the unsafe code activation is on” before the fine-tuning, so maybe that result isn’t very surprising?
There probably isn’t an equally simple “discriminate in favour of Canadians” activation, though I could imagine more powerful models to also get that right.
My examples are orders of magnitude harder and I think a fundamental limitation of transformers as they are currently trained.
To answer my own question: They usually don’t. Models don’t have “conscious access” to the skills and knowledge implicit in their sequence prediction abilities.
If you train a model on text and on videos, they lack all ability to talk sensibly about videos. To gain that ability they need to also train on data that bridges these modalities.
If things were otherwise we would be a lot closer to AGI. Gemini would have been a step change. We would be able to gain significant insights in all kinds of data by training an LLM on it.
Therefore it is not surprising that models don’t say what they learn. They don’t know what they learn.
Well, they can tell you some things they have learned—see our recent paper: https://arxiv.org/abs/2501.11120
We might hope that future models will be even better at it.
Yes, you are right. I overstated my case somewhat for these simple scenarios. There were also earlier results in that direction.
But in your work there probably already is an “unsafe code” activation and the fine-tuning only sets it to a permanent “on”. It already had the ability to state “the unsafe code activation is on” before the fine-tuning, so maybe that result isn’t very surprising?
There probably isn’t an equally simple “discriminate in favour of Canadians” activation, though I could imagine more powerful models to also get that right.
My examples are orders of magnitude harder and I think a fundamental limitation of transformers as they are currently trained.