Daniel Tan comments on avturchin’s Shortform

Daniel Tan 28 Mar 2025 16:11 UTC
12 points
2
This is pretty cool! Seems similar in flavour to https://arxiv.org/abs/2501.11120 you’ve found another instance where models are aware of their behaviour. But, you’ve additionally tested whether you can use this awareness to steer their behaviour. I’d be interested in seeing a slightly more rigorous write-up.
Have you compared to just telling the model not to hallucinate?
- avturchin 29 Mar 2025 11:53 UTC
  3 points
  0
  Parent
  I found that this does not work for finding an obscure quote from a novel. It still hallucinates different, more popular novels as sources and is confident in them. But it seems it doesn’t know the real answer, though I am sure that the needed novel was in its training dataset (it knows plot).
  - ErioirE 31 Mar 2025 17:51 UTC
    1 point
    0
    Parent
    This seems like a somewhat difficult use case for LLMs. It may be a mistake to think of them as a database of the *entire contents* of the training data. Perhaps instead think of them as compressed amalgamations of the the general patterns in the training data? I’m not terribly surprised that random obscure quotes can get optimized away.
    - avturchin 31 Mar 2025 21:43 UTC
      2 points
      0
      Parent
      Yes, but it knows all Bostrom articles, maybe because it has seen the list a hundred times.