Jon Garcia comments on Truthful and honest AI

Jon Garcia 2 Nov 2021 3:58 UTC
2 points
So the truthfulness of an agent depends on the expected truth value of statements that it generates. I would provisionally define the truth value of a message to be the degree to which is increases the mutual information between a recipient’s mental model of reality and the structure of reality itself. In other words, statements are true insofar as they enable the listener to make better predictions about what will happen in the real world, especially about what impact their actions will have, and they are false insofar as they reduce this ability. “Truth” is then a measure rather than a binary label and can be arbitrarily positive or negative.

Of course, this definition implies that the same message may have a high truth value for one person and a low or negative truth value for someone else (e.g., a brief explanation of the observer effect in quantum physics leads one person to increase the uncertainty in their world model and seek out additional information, another person to get bored, and another to start thinking that consciousness somehow controls reality at a fundamental level). And sometimes literally false statements can have a greater truth value than literally true statements (e.g., telling a small child that the earth is a perfect sphere adds more to their world model than telling them that its shape is approximated by the equipotential surface of a large, rotating clump of deformable matter). That is, the truth value of a statement depends in part on the listener’s ability to ingest the information. Ideally, a truthful AI would adjust its wording depending on its audience—using metaphors and parables for some people, precise scientific jargon for others, and simplified explanations for broad audiences—to maximize the predicted improvement in the predictive power of its listeners’ collective world models.
It also seems to me that, in terms of implementation at least, honesty is a prerequisite for achieving robust truthfulness. Consider the following 3-component model:
1. Truth-Seeking:
  Use a hierarchical generative model of the world to make predictions, in response to both passive observation and active intervention. Continuously update the model to minimize to free energy between the states that it predicts and those that it ends up observing. Making the model updates as Bayesian as possible, this will allow the AI’s model of reality to converge onto something that is homeomorphic to the true causal structure of reality.
2. Honesty:
  The part of the AI’s mind that generates language outputs needs to contain information that reliably maps to its own internal beliefs, acquired above. Assuming the first component of the AI works, such that its beliefs end up mapping well to reality, this second part can just focus on getting its own belief maps reported honestly, and it will achieve truthfulness by the transitive property.
3. Communication:
  Of course, actually communicating in a manner that achieves a positive truth value is another matter that may require some level of modeling of its intended recipient. That is, the AI should be able to form some idea of the beliefs of other agents and to predict the effect of possible messages on their beliefs. If these models of other agents are reliable enough (i.e., they are formed in the same way as the AI’s model of reality as a whole), then the AI could simply set a goal for itself of causing the belief maps of its interlocutor to match its own belief maps. (This goal should motivate the AI to improve its communication skills.) The precision of the match that it aims to achieve should of course be inversely proportional to the uncertainty that the AI has in its own world model: the less certain it is about something, the less strongly it should be motivated to get someone else to agree with it. (Uncertainty in its world model should also motivate it to ask targeted questions to fill in the gaps, but that probably involves modeling the trustworthiness and expertise of other agents, and I’m too tired to think of how to do that right now.)
To emphasize, the above model achieves robust truthfulness by the transitive property only in the scenario that all the links in the chain work as intended. I have no idea how the system might start to drift from truthfulness if any subcomponent goes awry.