Clarification request. In the writeup, you discuss the AI Bayes net and the human Bayes net as if there’s some kind of symmetry between them, but it seems to me that there’s at least one big difference.
In the case of the AI, the Bayes net is explicit, in the sense that we could print it out on a sheet of paper and try to study it once training is done, and the main reason we don’t do that is because it’s likely to be too big to make much sense of.
In the case of the human, we have no idea what the Bayes net looks like, because humans don’t have that kind of introspection ability. In fact, there’s not much difference between saying “the human uses a Bayes net” and “the human uses some arbitrary function F, and we worry the AI will figure out F and then use it to lie to us”.
Or am I actually wrong and it’s okay for a “builder” solution to assume we have access to the human Bayes net?
https://www.lesswrong.com/posts/sMsvcdxbK2Xqx8EHr/just-another-day-in-utopia