How likely is a random AGI to be honest?

(this post came out of a conversation between Critch, Ryan, Scott, and me)

Methods for aligning “messy” AGI systems (e.g. ALBA) usually at some point depend on finding a set of models, at least one of which is a good honest predictor. For example, this may be a set of predictors in KWIK learning, or a set of possible policies for a red team to use to construct simulations.

Concretely, we could consider training a bunch of different models (e.g. with different hyperparameters) to predict human answers to a variety of questions through some context changes. We train and filter models until we have different models that all perform well on these training tasks. Now we use some of the models to answer important test questions. We have some scheme similar to KWIK learning to ensure that our system answers questions well as long as at least one of the models answers test questions well. How high should be so that this is the case?

A naive analysis would conclude that does not need to be very high (it can be, say, 10). This is because all the models have performed well throughout context changes. The context change from training questions to test questions should not in principle be harder than these previous context changes.

One reason to suspect that is insufficient is that the context change from training questions to test questions is one from unimportant questions to important questions (i.e. humans actually use the answers to test questions to make important decisions). This creates an opportunity for a treacherous turn. If one of the models is a consequentialist agent, it may answer training questions honestly and then answer test questions dishonestly in a way that causes the humans to made decisions in that agent’s interests.

In the worst case, all successful models are consequentialist agents, so we must set high enough that at least one of the consequantialists has a goal compatible with always answering test questions honestly.

What’s the probability that a randomly-selected consequentialist that performs well on training tasks has such a goal? One rough way of estimating this is to imagine the “language of thought” of an AGI that has been trained to answer questions useful to humans. We can then think about the complexity of the “honesty” concept in its language of thought.

First I will estimate this complexity in the case of humans. By my very rough guesses:

  • about 10% of human concepts are social

  • about 5% of human social concepts are about cooperation

  • honesty is about 5% of cooperation

So the honesty concept might have a probability of about in the human concept space. As an independent check, let’s consider the frequency of words related to honesty in the English language. According to this page, the frequency of the word “honest” in English text is about . I’m not sure which estimate to take more seriously; for now I’ll just pretend that the honesty concept has a probability of something like in the human concept space.

How does this apply to AGI systems? If the AGI is good at modelling humans, then maybe in some sense it is devoting some amount of its conceptual vocabulary to “mirroring” human concepts. Say it is devoting 3% of the probability mass of its language of thought to human concepts. Then the human concept of honesty makes up about of its conceptual probability mass.

The AGI could be assigning substantially more than that much mass to some concept of honesty, perhaps because the AGI finds a more general concept of honesty instrumentally useful (maybe not the same as humans’ concept, but sufficient for answering test questions honestly).

A lot of these numbers are made up intuitively. Under uncertainty about what the right numbers are, I’m going to tentatively conclude that it is very likely that the proportion of successfully-human-predicting AGIs that are honest is at least , and somewhat likely that it is above .

Is the presence of at least one honest predictor sufficient for the question-answering system to work? Not necessarily; frameworks such as KWIK learning assume that at least one of the predictors is Bayes-optimal, whereas in reality Bayes-optimality is impossible for bounded reasoners. So further analysis will be necessary to positively establish that something like KWIK would work. But for now it appears that the frequency of dishonest models compared to honest models is not a fundamental obstruction to schemes based on predicting humans, at least if one is willing to gather millions of data points from humans.