(Update: I don’t think this was 100% right, see here for a better version.)
Attempted summary for morons like me: AI is trying to help the human H. They share access to a single output channel, e.g. a computer keyboard, so that the actions that H can take are exactly the same as the actions AI can take. Every step, AI can either take an action, or delegate to H to take an action. Also, every step, H reports her current assessment of the timeline / probability distribution for whether she’ll succeed at the task, and if so, how soon.
At first, AI will probably delegate to H a lot, and by watching H work, AI will gradually learn both the human policy (i.e. what H tends to do in different situations), and how different actions tend to turn out in hindsight from H’s own perspective (e.g., maybe whenever H takes action 17, she tends to declare shortly afterwards that probability of success now seems much higher than before—so really H should probably be taking action 17 more often!).
Presumably the AI, being a super duper fancy AI algorithm, learns to anticipate how different actions will turn out from H’s perspective much better than H herself. In other words, maybe it delegates to H, and H takes action 41, and the AI is watching this and shaking its head and thinking to itself “gee you dunce you’re gonna regret that”, and shortly thereafter the AI is proven correct.
OK, so now what? The naive answer would be: the AI should gradually stop delegating and start just doing the thing that leads to H feeling maximally optimistic later on.
But we don’t want to do that naive thing. There are two problems:
The first problem is “traps” (a.k.a. catastrophes). Let’s say action 0 is Press The History Eraser Button. H never takes that action. The AI shouldn’t either. What happens is: AI has no idea (wide confidence interval) about what the consequence of action 0 would be, so it doesn’t take it. This is the delegative RL thing—in the explore/exploit dilemma, the AI kinda sits back and lets H do the exploring, at least when a trap might be involved.
But here we want to go further. Let’s say that there’s some action which H might do, where the AI can predict the consequence pretty well, and the consequence is “H will eventually feel more optimistic about success in the task”. Then the AI can go do that action, right? Not necessarily! The action might still be bad, because it involved corrupting the feedback channel.
So let’s say “action 789 = H will eat a pot brownie”. Taking that action makes H both much more optimistic about success and much less likely to actually succeed in the task. (It makes her assessments inaccurate.)
If the AI has no idea what pot brownies do, it won’t take action 789 because of the previous thing—maybe that action is a trap. But what if the AI has (indirectly) figured out, with 100% confidence, exactly what the consequence of action 789 would be?
In that case, since the AI’s decision criterion is all about H’s future reported odds of success, a naive algorithm would have the AI take action 789. To avoid that, we have this quantilization thing. How does that work?
Well, let’s say that H right now assigns 99% chance of eventual success. Therefore—assuming that H is well-calibrated and knows the consequences of action 789—it follows that H’s policy is to take that action with ≤1% probability. Basically, H is optimistic about success in the task, partly because H knows that she’s almost definitely not going to eat that pot brownie. I mean, she might eat it, if she gets really bored and frustrated and hungry. But probably not! So she’s feeling very optimistic.
OK, now the AI quantilizes by drawing 5 random samples from (its understanding of) H’s policy, and taking the action that (it believes) will lead to the highest future-optimism-about-success. Assume its beliefs are accurate. Compared to just doing what H would have done, this has downsides and upsides. The downside is, since action 789 scores really well by the AI’s decision criterion, the probability of action 789 is no longer ≤1%, but rather ≤5%. The upside is, if we’re not taking action 789, i.e. the other ≥95% of the probability mass, we’re probably doing something quite a bit better than what H would have done.
These could plausibly balance out such that the expected probability of success strictly increases, compared to what H would have done.
Please correct me if I’m misunderstanding anything.
Notice that typically we use the AI for tasks which are hard for H. This means that without the AI’s help, H’s probability of success will usually be low. Quantilization-wise, this is a problem: the AI will be able to eliminate those paths for which H will report failure, but maybe most of the probability mass among apparent-success paths is still on failure (i.e. the success report is corrupt). This is why the timeline part is important.
On a typical task, H expects to fail eventually but they don’t expect to fail soon. Therefore, the AI can safely consider a policies of the form “in the short-term, do something H would do with marginal probability, in the long-term go back to H’s policy”. If by the end of the short-term maneuver H reports an improved prognosis, this can imply that the improvement is genuine (since the AI knows H is probably uncorrupted at this point). Moreover, it’s possible that in the new prognosis H still doesn’t expect to fail soon. This allows performing another maneuver of the same type. This way, the AI can iteratively steer the trajectory towards true success.
(Update: I don’t think this was 100% right, see here for a better version.)
Attempted summary for morons like me: AI is trying to help the human H. They share access to a single output channel, e.g. a computer keyboard, so that the actions that H can take are exactly the same as the actions AI can take. Every step, AI can either take an action, or delegate to H to take an action. Also, every step, H reports her current assessment of the timeline / probability distribution for whether she’ll succeed at the task, and if so, how soon.
At first, AI will probably delegate to H a lot, and by watching H work, AI will gradually learn both the human policy (i.e. what H tends to do in different situations), and how different actions tend to turn out in hindsight from H’s own perspective (e.g., maybe whenever H takes action 17, she tends to declare shortly afterwards that probability of success now seems much higher than before—so really H should probably be taking action 17 more often!).
Presumably the AI, being a super duper fancy AI algorithm, learns to anticipate how different actions will turn out from H’s perspective much better than H herself. In other words, maybe it delegates to H, and H takes action 41, and the AI is watching this and shaking its head and thinking to itself “gee you dunce you’re gonna regret that”, and shortly thereafter the AI is proven correct.
OK, so now what? The naive answer would be: the AI should gradually stop delegating and start just doing the thing that leads to H feeling maximally optimistic later on.
But we don’t want to do that naive thing. There are two problems:
The first problem is “traps” (a.k.a. catastrophes). Let’s say action 0 is Press The History Eraser Button. H never takes that action. The AI shouldn’t either. What happens is: AI has no idea (wide confidence interval) about what the consequence of action 0 would be, so it doesn’t take it. This is the delegative RL thing—in the explore/exploit dilemma, the AI kinda sits back and lets H do the exploring, at least when a trap might be involved.
But here we want to go further. Let’s say that there’s some action which H might do, where the AI can predict the consequence pretty well, and the consequence is “H will eventually feel more optimistic about success in the task”. Then the AI can go do that action, right? Not necessarily! The action might still be bad, because it involved corrupting the feedback channel.
So let’s say “action 789 = H will eat a pot brownie”. Taking that action makes H both much more optimistic about success and much less likely to actually succeed in the task. (It makes her assessments inaccurate.)
If the AI has no idea what pot brownies do, it won’t take action 789 because of the previous thing—maybe that action is a trap. But what if the AI has (indirectly) figured out, with 100% confidence, exactly what the consequence of action 789 would be?
In that case, since the AI’s decision criterion is all about H’s future reported odds of success, a naive algorithm would have the AI take action 789. To avoid that, we have this quantilization thing. How does that work?
Well, let’s say that H right now assigns 99% chance of eventual success. Therefore—assuming that H is well-calibrated and knows the consequences of action 789—it follows that H’s policy is to take that action with ≤1% probability. Basically, H is optimistic about success in the task, partly because H knows that she’s almost definitely not going to eat that pot brownie. I mean, she might eat it, if she gets really bored and frustrated and hungry. But probably not! So she’s feeling very optimistic.
OK, now the AI quantilizes by drawing 5 random samples from (its understanding of) H’s policy, and taking the action that (it believes) will lead to the highest future-optimism-about-success. Assume its beliefs are accurate. Compared to just doing what H would have done, this has downsides and upsides. The downside is, since action 789 scores really well by the AI’s decision criterion, the probability of action 789 is no longer ≤1%, but rather ≤5%. The upside is, if we’re not taking action 789, i.e. the other ≥95% of the probability mass, we’re probably doing something quite a bit better than what H would have done.
These could plausibly balance out such that the expected probability of success strictly increases, compared to what H would have done.
Please correct me if I’m misunderstanding anything.
This is about right.
Notice that typically we use the AI for tasks which are hard for H. This means that without the AI’s help, H’s probability of success will usually be low. Quantilization-wise, this is a problem: the AI will be able to eliminate those paths for which H will report failure, but maybe most of the probability mass among apparent-success paths is still on failure (i.e. the success report is corrupt). This is why the timeline part is important.
On a typical task, H expects to fail eventually but they don’t expect to fail soon. Therefore, the AI can safely consider a policies of the form “in the short-term, do something H would do with marginal probability, in the long-term go back to H’s policy”. If by the end of the short-term maneuver H reports an improved prognosis, this can imply that the improvement is genuine (since the AI knows H is probably uncorrupted at this point). Moreover, it’s possible that in the new prognosis H still doesn’t expect to fail soon. This allows performing another maneuver of the same type. This way, the AI can iteratively steer the trajectory towards true success.