It seems to me that you could get around this problem by training a model *() that takes M(x) and outputs M’s beliefs about inaccessible statements about x after seeing x as input. You could train *() by generating latent information y and then using that information y to generate x. From there, compute M(x) and minimize the loss L(*(M(x)),y). If you do this for a sufficiently broad set of (x,y) pairs, you might have the ability to extract arbitrary information from M’s beliefs. It might also be possible for *() to gain access to information that M “knows” in the sense that it has all the relevant information, but is still inaccessible to M since M lacks the logical machinery to put together that information.
This is similar to HS english multiple choice questions, where the reader must infer something about the text they just read. It’s also similar to experiments where neuroscience researchers train a model to map brain cell activity in animals to what an animal sees.
It seems to me that you could get around this problem by training a model *() that takes M(x) and outputs M’s beliefs about inaccessible statements about x after seeing x as input. You could train *() by generating latent information y and then using that information y to generate x. From there, compute M(x) and minimize the loss L(*(M(x)),y). If you do this for a sufficiently broad set of (x,y) pairs, you might have the ability to extract arbitrary information from M’s beliefs. It might also be possible for *() to gain access to information that M “knows” in the sense that it has all the relevant information, but is still inaccessible to M since M lacks the logical machinery to put together that information.
This is similar to HS english multiple choice questions, where the reader must infer something about the text they just read. It’s also similar to experiments where neuroscience researchers train a model to map brain cell activity in animals to what an animal sees.