Had a new idea just now:
You can experimentally fine-tune it to be better at self-predicting.
Get a dataset of short answer questions, that subjective answers. Like: what is your favorite color? What breed of dog would you choose if you had to pick one for a pet?
Etc
Then, record a given models most probable answers. Then, fine tune the model to answer the question of the form, “if i were to ask a different instance of you the following question, what do you think that insurance would say? <Question>” with the answer that the other instance did give.
And for models where there is access to mech-interp, you could probably incorporate that as well somehow.
Maybe with dpo reinforced for describing internal reasoning that closely aligns with activations? Would have to find a way to line that up in an objective way that would allow for easy synthetic dataset generation, though
Had a new idea just now: You can experimentally fine-tune it to be better at self-predicting. Get a dataset of short answer questions, that subjective answers. Like: what is your favorite color? What breed of dog would you choose if you had to pick one for a pet? Etc
Then, record a given models most probable answers. Then, fine tune the model to answer the question of the form, “if i were to ask a different instance of you the following question, what do you think that insurance would say? <Question>” with the answer that the other instance did give.
check it out @Nathan Helm-Burger . It appears to be possible vindication on the ‘signal to noise’ part of the hypothesis:
Anthropic Post
That’s a good idea.
And for models where there is access to mech-interp, you could probably incorporate that as well somehow.
Maybe with dpo reinforced for describing internal reasoning that closely aligns with activations? Would have to find a way to line that up in an objective way that would allow for easy synthetic dataset generation, though