Consider a human Hugh who has access to a question-answering machine. Suppose the machine answers question Q by perfectly imitating how Hugh would answer question Q, ifHugh had access to the question-answering machine.

That is, Hugh is able to consult a copy of Hugh, who is able to consult a copy of Hugh, who is able to consult a copy of Hugh…

Let’s call this process HCH, for “Humans Consulting HCH.”

I’ve talked about many variants of this process before, but I find it easier to think about with a nice handle. (Credit to Eliezer for proposing using a recursive acronym.)

HCH is easy to specify very precisely. For now, I think that HCH is our best way to precisely specify “a human’s enlightened judgment.” It’s got plenty of problems, but for now I don’t know anything better.

Elaborations

We can define realizable variants of this inaccessible ideal:

For a particular prediction algorithm P, define HCHᴾ as: “P’s prediction of what a human would say after consulting HCHᴾ”

For a reinforcement learning algorithm A, define max-HCHᴬ as: “A’s output when maximizing the evaluation of a human after consulting max-HCHᴬ”

For a given market structure and participants, define HCHᵐᵃʳᵏᵉᵗ as: “the market’s prediction of what a human will say after consulting HCHᵐᵃʳᵏᵉᵗ”

Note that e.g. HCHᴾ is totally different from “P’s prediction of HCH.” HCHᴾ will generally make worse predictions, but it is easier to implement.

Hope

The best case is that HCHᴾ, max-HCHᴬ, and HCHᵐᵃʳᵏᵉᵗ are:

As capable as the underlying predictor, reinforcement learner, or market participants.

Aligned with the enlightened judgment of the human, e.g. as evaluated by HCH.

(At least when the human is suitably prudent and wise.)

It is clear from the definitions that these systems can’t be any more capable than the underlying predictor/learner/market. I honestly don’t know whether we should expect them to match the underlying capabilities. My intuition is that max-HCHᴬ probably can, but that HCHᴾ and HCHᵐᵃʳᵏᵉᵗ probably can’t.

It is similarly unclear whether the system continues to reflect the human’s judgment. In some sense this is in tension with the desire to be capable — the more guarded the human, the less capable the system but the more likely it is to reflect their interests. The question is whether a prudent human can achieve both goals.

This was originally posted here on 29th January 2016.

Tomorrow’s AI Alignment Forum sequences will take a break, and tomorrow’s post will be Issue #34 of the Alignment Newsletter.

The next post in this sequence is ‘Corrigibility’ by Paul Christiano, which will be published on Tuesday 27th November.

Another question. HCH is defined as a fixed point of a certain process. But that process probably has many fixed points, some of which might be weird. For example, HCH could return a “universal answer” that brainwashes the human using it into returning the same “universal answer”. Or it could be irrationally convinced that e.g. God exists but a proof of that can’t be communicated. How does the landscape of fixed points look like? Since we’ll presumably approximate HCH by something other than actually simulating a lot of people, will the approximation lead to the right fixed point?

Yes, if the queries aren’t well-founded then HCH isn’t uniquely defined even once you specify H, there is a class of solutions. If there is a bad solution, I think you need to do work to rule it out and wouldn’t count on a method magically finding the answer.

It is not at all clear to me how this works. The questions that immediately occur to me are:

How does the recursion bottom out? If real Hugh’s response to the question is to ask the machine, then perfectly simulated Hugh’s response must be the same. If real Hugh’s response is not to ask the machine, then the machine remains unused.

If, somehow, it bottoms out at level n, then Hugh^n must be answering without consulting the HCH. How does that simulated Hugh differ from Hugh^(n-1), that it is able to do something different?

Does Hugh^n know he’s Hugh^n?

If the Hugh^i (i<n) all just relay Hugh^n’s answer, what is gained over Hugh-prime answering directly?

How does the recursion bottom out? If real Hugh’s response to the question is to ask the machine, then perfectly simulated Hugh’s response must be the same. If real Hugh’s response is not to ask the machine, then the machine remains unused.

I think there are lots of strategies here that just fail to work. For example, if Hugh passes on the question with no modification, then you build an infinite tower that never does any work.

But there are strategies that do work. For example, whenever Hugh receives a question he can answer, he does so, and whenever he receives a question that is ‘too complicated’, he divides it into subquestions and consults HCH separately on each subquestion, using the results of the consultation to compute the overall answer. This looks like it will terminate, so long as the answers can flow back up the pyramid. Hugh could also pass along numbers about how subdivided a question has become, or the whole stack trace so far, in case there are problems that seem like they have cyclical dependencies (where I want to find out A, which depends on B, which depends on C, which depends on A, which depends on...). Hugh could pass back upwards results like “I didn’t know how to make progress on the subproblem you gave me.”

For example, you could imagine attempting to prove a mathematical conjecture. The first level has Hugh looking at the whole problem, and he thinks “I don’t know how to solve this, but I would know how to solve it if I had lemmas like A, B, and C.” So he asks HCH to separately solve A, B, and C. This spins up a copy of Hugh looking at A, who also thinks “I don’t know how to solve this, but I would if I had lemmas like Aa, Ab, and Ac.” This spins up a copy of Hugh looking at Aa, who thinks “oh, this is solvable like so; here’s a proof of Aa.” Hugh_A is now looking at the proofs, disproofs, and indeterminates of Aa, Ab, and Ac, and now can either write their conclusion about A, or spins up new subagents to examine new subparts of the problem.

Note that in this formulation, you primarily have communication up and down the pyramid, and the communication is normally at the creation and destruction of subagents. It could end up that you prove the same lemma thousands of times across the branches of the tree, because it turned out to be useful in many different places.

So, one way of solving the recursion problem would be for Hugh to never use the machine as a first resort for answering a question Q. Instead, Hugh must resolve to ask the machine only for answers to questions that are “smaller” than Q in some well-ordered sense, and do the rest of the work himself.

But unless the machine is faster at simulating Hugh than Hugh is at being Hugh, it is not clear what is gained. Even if it is, all you get is the same answer that unaided Hugh would have got, but faster.

Without resource constraints, I feel like my intuition kinda slides off the model. Do you have a sense of HCH’s performance under resource constraints? For example, let’s say each human can spend 1 day thinking, make 10 queries to the next level, and there are 10 levels in total. What’s the hardest problem solvable by this setup that you can imagine?

Depends on the human. I think 10 levels with branching factor 10 and 1 day per step is in the ballpark of “go from no calculus to general relativity,” (at least if we strengthen the model by allowing pointers) but it’s hard to know and most people aren’t so optimistic.

Yeah, I don’t know how optimistic I should be, given that one day isn’t enough even to get fluent with calculus. Can you describe the thought process behind your guess? Maybe describe how you imagine the typical days of people inside the tree, depending on the level?

## Humans Consulting HCH

(See also:strong HCH.)Consider a human Hugh who has access to a question-answering machine. Suppose the machine answers question Q by perfectly imitating how Hugh would answer question Q,

ifHugh had access to the question-answering machine.That is, Hugh is able to consult a copy of Hugh, who is able to consult a copy of Hugh, who is able to consult a copy of Hugh…

Let’s call this process HCH, for “Humans Consulting HCH.”

I’ve talked about many variants of this process before, but I find it easier to think about with a nice handle. (Credit to Eliezer for proposing using a recursive acronym.)

HCH is easy to specify very precisely. For now, I think that HCH is our best way to precisely specify “a human’s enlightened judgment.” It’s got plenty of problems, but for now I don’t know anything better.

## Elaborations

We can define realizable variants of this inaccessible ideal:

For a particular prediction algorithm P, define HCHᴾ as:

“P’s prediction of what a human would say after consulting HCHᴾ”

For a reinforcement learning algorithm A, define max-HCHᴬ as:

“A’s output when maximizing the evaluation of a human after consulting max-HCHᴬ”

For a given market structure and participants, define HCHᵐᵃʳᵏᵉᵗ as:

“the market’s prediction of what a human will say after consulting HCHᵐᵃʳᵏᵉᵗ”

Note that e.g. HCHᴾ is totally different from “P’s prediction of HCH.” HCHᴾ will generally make worse predictions, but it is easier to implement.

## Hope

The best case is that HCHᴾ, max-HCHᴬ, and HCHᵐᵃʳᵏᵉᵗ are:

As capable as the underlying predictor, reinforcement learner, or market participants.

Aligned with the enlightened judgment of the human, e.g. as evaluated by HCH.

(At least when the human is suitably prudent and wise.)

It is clear from the definitions that these systems can’t be any

morecapable than the underlying predictor/learner/market. I honestly don’t know whether we should expect them to match the underlying capabilities. My intuition is that max-HCHᴬ probably can, but that HCHᴾ and HCHᵐᵃʳᵏᵉᵗ probably can’t.It is similarly unclear whether the system continues to reflect the human’s judgment. In some sense this is in tension with the desire to be capable — the more guarded the human, the less capable the system but the more likely it is to reflect their interests. The question is whether a prudent human can achieve both goals.

This was originally posted here on 29th January 2016.

Tomorrow’s AI Alignment Forum sequences will take a break, and tomorrow’s post will be Issue #34 of the Alignment Newsletter.The next post in this sequence is ‘Corrigibility’ by Paul Christiano, which will be published on Tuesday 27th November.Another question. HCH is defined as a fixed point of a certain process. But that process probably has many fixed points, some of which might be weird. For example, HCH could return a “universal answer” that brainwashes the human using it into returning the same “universal answer”. Or it could be irrationally convinced that e.g. God exists but a proof of that can’t be communicated. How does the landscape of fixed points look like? Since we’ll presumably approximate HCH by something other than actually simulating a lot of people, will the approximation lead to the right fixed point?

Yes, if the queries aren’t well-founded then HCH isn’t uniquely defined even once you specify H, there is a class of solutions. If there is a bad solution, I think you need to do work to rule it out and wouldn’t count on a method magically finding the answer.

It is not at all clear to me how this works. The questions that immediately occur to me are:

How does the recursion bottom out? If real Hugh’s response to the question is to ask the machine, then perfectly simulated Hugh’s response must be the same. If real Hugh’s response is not to ask the machine, then the machine remains unused.

If, somehow, it bottoms out at level n, then Hugh^n must be answering without consulting the HCH. How does that simulated Hugh differ from Hugh^(n-1), that it is able to do something different?

Does Hugh^n know he’s Hugh^n?

If the Hugh^i (i<n) all just relay Hugh^n’s answer, what is gained over Hugh-prime answering directly?

I think there are lots of strategies here that just fail to work. For example, if Hugh passes on the question with no modification, then you build an infinite tower that never does any work.

But there are strategies that do work. For example, whenever Hugh receives a question he can answer, he does so, and whenever he receives a question that is ‘too complicated’, he divides it into subquestions and consults HCH separately on each subquestion, using the results of the consultation to compute the overall answer. This looks like it will terminate, so long as the answers can flow back up the pyramid. Hugh could also pass along numbers about how subdivided a question has become, or the whole stack trace so far, in case there are problems that seem like they have cyclical dependencies (where I want to find out A, which depends on B, which depends on C, which depends on A, which depends on...). Hugh could pass back upwards results like “I didn’t know how to make progress on the subproblem you gave me.”

For example, you could imagine attempting to prove a mathematical conjecture. The first level has Hugh looking at the whole problem, and he thinks “I don’t know how to solve this, but I would know how to solve it if I had lemmas like A, B, and C.” So he asks HCH to separately solve A, B, and C. This spins up a copy of Hugh looking at A, who also thinks “I don’t know how to solve this, but I would if I had lemmas like Aa, Ab, and Ac.” This spins up a copy of Hugh looking at Aa, who thinks “oh, this is solvable like so; here’s a proof of Aa.” Hugh_A is now looking at the proofs, disproofs, and indeterminates of Aa, Ab, and Ac, and now can either write their conclusion about A, or spins up new subagents to examine new subparts of the problem.

Note that in this formulation, you primarily have communication up and down the pyramid, and the communication is normally at the creation and destruction of subagents. It could end up that you prove the same lemma thousands of times across the branches of the tree, because it turned out to be useful in many different places.

So, one way of solving the recursion problem would be for Hugh to never use the machine as a first resort for answering a question Q. Instead, Hugh must resolve to ask the machine only for answers to questions that are “smaller” than Q in some well-ordered sense, and do the rest of the work himself.

But unless the machine is faster at simulating Hugh than Hugh is at being Hugh, it is not clear what is gained. Even if it is, all you get is the same answer that unaided Hugh would have got, but faster.

Without resource constraints, I feel like my intuition kinda slides off the model. Do you have a sense of HCH’s performance under resource constraints? For example, let’s say each human can spend 1 day thinking, make 10 queries to the next level, and there are 10 levels in total. What’s the hardest problem solvable by this setup that you can imagine?

Depends on the human. I think 10 levels with branching factor 10 and 1 day per step is in the ballpark of “go from no calculus to general relativity,” (at least if we strengthen the model by allowing pointers) but it’s hard to know and most people aren’t so optimistic.

Yeah, I don’t know how optimistic I should be, given that one day isn’t enough even to get fluent with calculus. Can you describe the thought process behind your guess? Maybe describe how you imagine the typical days of people inside the tree, depending on the level?