This is pretty cool! Seems similar in flavour to https://arxiv.org/abs/2501.11120 you’ve found another instance where models are aware of their behaviour. But, you’ve additionally tested whether you can use this awareness to steer their behaviour. I’d be interested in seeing a slightly more rigorous write-up.
Have you compared to just telling the model not to hallucinate?
I found that this does not work for finding an obscure quote from a novel. It still hallucinates different, more popular novels as sources and is confident in them. But it seems it doesn’t know the real answer, though I am sure that the needed novel was in its training dataset (it knows plot).
This seems like a somewhat difficult use case for LLMs. It may be a mistake to think of them as a database of the *entire contents* of the training data. Perhaps instead think of them as compressed amalgamations of the the general patterns in the training data? I’m not terribly surprised that random obscure quotes can get optimized away.
This is pretty cool! Seems similar in flavour to https://arxiv.org/abs/2501.11120 you’ve found another instance where models are aware of their behaviour. But, you’ve additionally tested whether you can use this awareness to steer their behaviour. I’d be interested in seeing a slightly more rigorous write-up.
Have you compared to just telling the model not to hallucinate?
I found that this does not work for finding an obscure quote from a novel. It still hallucinates different, more popular novels as sources and is confident in them. But it seems it doesn’t know the real answer, though I am sure that the needed novel was in its training dataset (it knows plot).
This seems like a somewhat difficult use case for LLMs. It may be a mistake to think of them as a database of the *entire contents* of the training data. Perhaps instead think of them as compressed amalgamations of the the general patterns in the training data? I’m not terribly surprised that random obscure quotes can get optimized away.
Yes, but it knows all Bostrom articles, maybe because it has seen the list a hundred times.