One speculation I have: an alternative way to study the plausibility of your environment if you have logit access is to tell the model it’s in a simulation (or fake a CoT of the model realising it’s in a simulation) and see if the perplexity of this suffix measures how “surprised” the model is. Could be too noisy to be practical (+ you have to think if there are any other reasons why it would be surprising to be told this at this point, in this particular phrasing), but it hopefully works around the issues with the question itself being evidence.
(I think studying model beliefs through perplexity has been tried before, but I’m fairly new to the literature and don’t know if it works / if applying it to this problem particularly is novel.)
One speculation I have: an alternative way to study the plausibility of your environment if you have logit access is to tell the model it’s in a simulation (or fake a CoT of the model realising it’s in a simulation) and see if the perplexity of this suffix measures how “surprised” the model is. Could be too noisy to be practical (+ you have to think if there are any other reasons why it would be surprising to be told this at this point, in this particular phrasing), but it hopefully works around the issues with the question itself being evidence.
(I think studying model beliefs through perplexity has been tried before, but I’m fairly new to the literature and don’t know if it works / if applying it to this problem particularly is novel.)