The AGI needs to be honest

Imagine that you are a trained mathematician and you have been assigned the job of testing an arbitrarily intelligent chatbot for its intelligence.

You being knowledgeable about a fair amount of computer-science theory won’t test it with the likes of Turing-test or similar, since such a bot might not have any useful priors about the world.

You have asked it find a proof for the Riemann-hypothesis. the bot started its search program and after several months it gave you gigantic proof written in a proof checking language like coq.

You have tried to run the proof through a proof-checking assistant but quickly realized that checking that itself would years or decades, also no other computer except the one running the bot is sophisticated enough to run proof of such length.

You have asked the bot to provide you a zero-knowledge-proof, but being a trained mathematician you know that a zero-knowledge-proof of sufficient credibility requires as much compute as the original one. also, the correctness is directly linked to the length of the proof it generates.

You know that the bot may have formed increasingly complex abstractions while solving the problem, and it would be very hard to describe those in exact words to you.

You have asked the bot to summarize the proof for you in natural-language, but you know that the bot can easily trick you into accepting the proof.

You have now started to think about a bigger question, the bot essentially is a powerful optimizer. In this case, the bot is trained to find proofs, its reward is based on finding what a group of mathematicians agree on how a correct proof looks like.

But the bot being a bot doesn’t care about being honest to you or to itself, it is not rewarded for being “honest” it is only being rewarded for finding proof-like strings that humans may select or reject.

So it is far easier for it to find a large coq-program, large enough that you cannot check by any other means, than to actually solve riemann-hypothesis.

Now you have concluded that before you certify that the bot is intelligent, you have to prove that the bot is being honest.

Going by the current trend, it is okay for us to assume that such an arbitrarily intelligent bot would have a significant part of it based on the principles of the current deep-learning stack. assume it be a large neural-network-based agent, also assume that the language-understanding component is somewhat based on the current language-model design.

So how do you know that the large language model is being honest?

A quick look at the plots of results on truthful-qa dataset shows that truthfulness reduces with the model-size, going by this momentum any large-models trained on large datasets are more likely to give fluke answers to significantly complex questions.

Any significantly complex decision-question if cast into an optimization problem has one hard-to-find global-minima called “truth” but extremely large number of easy-to-find local-minima of falsehoods, how do you then make a powerful optimizer optimize for honesty?