Nathan Helm-Burger comments on Recursive Self-Modeling as a Plausible Mechanism for Real-time Introspection in Current Language Models

Nathan Helm-Burger 24 Jan 2025 1:00 UTC
3 points
0
Had a new idea just now: You can experimentally fine-tune it to be better at self-predicting. Get a dataset of short answer questions, that subjective answers. Like: what is your favorite color? What breed of dog would you choose if you had to pick one for a pet? Etc

Then, record a given models most probable answers. Then, fine tune the model to answer the question of the form, “if i were to ask a different instance of you the following question, what do you think that insurance would say? <Question>” with the answer that the other instance did give.
- rife 27 Mar 2025 23:24 UTC
  3 points
  0
  Parent
  check it out @Nathan Helm-Burger . It appears to be possible vindication on the ‘signal to noise’ part of the hypothesis:
  Anthropic Post
- rife 24 Jan 2025 1:14 UTC
  3 points
  2
  Parent
  That’s a good idea.
  
  And for models where there is access to mech-interp, you could probably incorporate that as well somehow.
  
  Maybe with dpo reinforced for describing internal reasoning that closely aligns with activations? Would have to find a way to line that up in an objective way that would allow for easy synthetic dataset generation, though