The top-level distribution for Amplify(A) puts 50% of its mass on “Try to verify that the current agent is benign.” (Or something like 50%, it’s <100% only because you are training A both to evaluate actions and to certify benignness.) If it performs badly on the current agent, then it’s not performing well on average. There is no other distribution over distilled agents that we are defining the performance guarantee with respect to.

Because the current agent is training over time, we really want to talk about online guarantees rather than assuming a fixed distribution. Amplify(A) is being trained on the sequence of tasks “Try to verify that the agent at t=0 is benign,” “Try to verify that the agent at t=1 is benign,” “Try to verify that the agent at t=2 is benign...” etc.

If there was a learnable model that performs well on most of the subtasks of “Try to verify that the current agent is benign,” then we get to infer that A in fact performs well on most of these subtasks. If amplification preserves average-case performance, then we get to conclude that Amplify(A) performs well on most tasks of the form “Try to verify that the current agent is benign.” If Amplify(A) consistently concludes that the agent is benign, e.g. if Amplify(A) consistently fails to find an input on which the agent behaves malignly, then we hope to conclude that the agent is in fact benign at most steps.

Thanks, I think I’m finally starting to understand your thinking here.

Because the current agent is training over time, we really want to talk about online guarantees rather than assuming a fixed distribution.

Sounds like I really need to figure out how online guarantees work in relation to your scheme. (It’s not clear to me how to map between your setup and the setup in the online learning survey that you cited, e.g., what corresponds to “receive true answer” after every prediction and what corresponds to the set of hypotheses that regret is being measured against.) I’ve been putting it off and just assuming a fixed distribution because you wrote “Overall, I don’t think this distinction is a huge deal.”

If there was a learnable model that performs well on most of the subtasks of “Try to verify that the current agent is benign,”

How do we determine this? (What if the current agent has moved into a part of the agent space such that there was no longer a learnable model that performs well on most of the subtasks of “Try to verify that the current agent is benign”?)

if Amplify(A) consistently fails to find an input on which the agent behaves malignly, then we hope to conclude that the agent is in fact benign at most steps.

What does “most” mean here and why is that good enough? (If there are more than 100 steps and “most” means 99% then you can’t rule out having malign agents in some of the steps, which seems like a problem?)

As part of designing a technique for optimizing worst-case performance, we need to argue that the overseer’s job isn’t too hard (so that Amplify(A) is qualified to perform the task). If we remove this restriction, then optimizing worst case performance wouldn’t be scary—adversarial training would probably work fine.

(It’s not clear to me how to map between your setup and the setup in the online learning survey that you cited, e.g., what corresponds to “receive true answer” after every prediction and what corresponds to the set of hypotheses that regret is being measured against.)

See the section “Limited feedback (bandits)” starting on page 177. Online learning doesn’t require seeing the true answer.

We don’t compete with any explicit set of hypotheses. When we say that the “model can learn to do X” then we are saying roughly “the model competes with a set of hypotheses including one that does X.”

What does “most” mean here and why is that good enough?

Most means most agents over the training process. But:

Once you have an agent that seems OK, you can freeze that agent and then run the certification process for significantly longer.

I expect the model is probably going to have some probability of behaving malignly on any given input anyway based on internal stochasticity. So you probably already need to do something based on ensembling / ensuring sufficient per-timestep robustness.

See the section “Limited feedback (bandits)” starting on page 177. Online learning doesn’t require seeing the true answer.

I’m still having trouble matching up that section with your setup. (It assumes that the agent sees the value of the loss function after every prediction, which I think is not the case in your setup?) Is Section 6 on Online Active Learning in this more comprehensive survey closer to what you have in mind? If so, can you say which of the subsections of Section 6 is the closest? Or alternatively, can you explain the actual formal setup and guarantee you’re hoping ML research will provide, which will be sufficient to accomplish what you need? (Or give an example of such formal setup/guarantee if multiple ones could work?)

Also, what if in the future the most competitive ML algorithms do not provide the kinds of guarantees you need? How likely do you think that is, and what’s the expected outcome (for your approach and AI alignment in general) conditional on that?

We don’t compete with any explicit set of hypotheses. When we say that the “model can learn to do X” then we are saying roughly “the model competes with a set of hypotheses including one that does X.”

Don’t we need to know the size of the set of hypotheses in order to derive a regret bound?

It assumes that the agent sees the value of the loss function after every prediction, which I think is not the case in your setup?

You do get to see the loss function, if you couldn’t see the loss function then we couldn’t train A.

Amplify(A) is computed by calling A a bunch of times. The point of amplification is to set things up so that Amplify(A) will work well if the average call to A works well. A random subset of the calls to A are then evaluated (by calling Amplify(A)), so we get to see their loss.

(Obviously you get identical expected regret bounds if you evaluate an x fraction of decisions at random, just with 1/x times more regret—you get a regret bound on the sequence whose loss you evaluate, and that regret is at least x times the total.)

What does d (the number of bandit arms) correspond to in your setup? I’m guessing it’s the size of the hypothesis class that you’re competing with, which must be exponentially large? Since the total regret bound is 2√dlog(d)T (page 181, assuming you see the loss every round) it seems that you’d have to see an exponential number of losses (i.e., calls to Amplify(A)) before you could get a useful per-round guarantee. What am I missing here?

The d under the log is the size of the hypothesis class (which is exponential in this case). The other d parameterizes the difficulty of the exploration problem. Exp4 is the simplest algorithm that pulls those two parameters apart (though it’s obviously not a good algorithm for this case). It’s hard to formally capture “the difficulty of the exploration problem”, but intuitively it’s something like what you’d expect—how many options do you have to try at random before you are guaranteed to get useful signal? This is upper bounded by the number of output options. You can get tighter formal bounds in many cases but it’s one of those things where the real bound is kind of a problem-specific mess.

There are two hopes for not needing exponential time:

In imitation+RL, the exploration difficulty should depend on something like the accuracy of your imitation rather than on the size of the underlying domain (or maybe even better). You don’t have to try everything at random to get signal, if you have access to an expert who shows you a good option in each round. We can train A with demonstrations (we can get a demonstration just by calling Amplify(A)).

Many RL problems have tractable exploration despite large domains for a whole mess of complicated reasons.

(And note that we’ll be able to tell whether this is working, so in practice this is probably something that we can validate empirically—not something where we are going up against adversarial optimization pressure and so need a provable bound.)

This doesn’t seem likely to be the place where my approach gets stuck, and I don’t think it seems worth thinking about it that much until we’ve made much more progress on understanding the task that Amplify(A) actually needs to perform for robustness and on how amplification works more broadly, since (a) those are way more likely to be dealbreakers, in which case this doesn’t matter, (b) it’s relatively likely that other progress will change our conception of the learning theory problem we need to solve or obsolete it entirely.

If you want to understand these intuitions in detail it likely requires doing the equivalent of a course in learning theory and reading a bunch of papers in the area (which doesn’t sound worth it to me, as a use of your time). Overall this isn’t something where I feel excited about engaging in detail, except with experts in the relevant areas who I expect to know something or have intuitions that I don’t.

I have some additional questions that I’m still not sure about, that maybe you can answer without spending too much time.

It seems that if you can get a good online guarantee you are immune to distributional shifts (the online guarantee gives you a bound that’s not based on any assumptions about the input data). But to be practically meaningful the hypothesis class (on which the regret bound is based) has to include one that can approximate X well over all possible inputs, otherwise you could prove a good regret bound, and even perform well in practice, but still suffer badly from some future distributional shift, right?

Are you thinking that it will be possible to design X and the ML algorithm together such that we’ll know it can learn to approximate X well over all possible inputs, or alternatively are you thinking of some sort of online guarantee that does make some assumptions about the input data (e.g., that its distribution doesn’t change too quickly)? Or both (in a disjunctive sense)?

Are there any relevant papers you can point to that gave you the intuitions that you have on these questions?

It seems that if you can get a good online guarantee you are immune to distributional shifts (the online guarantee gives you a bound that’s not based on any assumptions about the input data).

The online guarantee says that on average, over a large sequence of trials, you will perform well. But if I train my system for a while and then deploy it, it could perform arbitrarily poorly after deployment (until I incorporate corrective data, which will generally be impossible for catastrophic failures).

But to be practically meaningful the hypothesis class (on which the regret bound is based) has to include one that can approximate X well over all possible inputs, otherwise you could prove a good regret bound, and even perform well in practice, but still suffer badly from some future distributional shift, right?

I don’t understand this (might be related to the previous point). If there is a hypothesis that performs well over the sequence of actual cases that you train on, then you will perform well on the sequence of actual data cases that you train on. For any other inputs, the online guarantee doesn’t say anything.

Are you thinking that it will be possible to design X and the ML algorithm together such that we’ll know it can learn to approximate X well over all possible inputs, or alternatively are you thinking of some sort of online guarantee that does make some assumptions about the input data (e.g., that its distribution doesn’t change too quickly)? Or both (in a disjunctive sense)?

I don’t think that anything will be learning to approximate anything else well over all possible inputs.

What does “X” refer to here?

I’m not imagining making any assumptions on the input data.

Are there any relevant papers you can point to that gave you the intuitions that you have on these questions?

I don’t think I fully understood the questions.

The online guarantee says that on average, over a large sequence of trials, you will perform well. But if I train my system for a while and then deploy it, it could perform arbitrarily poorly after deployment (until I incorporate corrective data, which will generally be impossible for catastrophic failures).

Take the 2√dlog(d)T regret bound as an example. Suppose dlog(d) is small (what I meant by “a good online guarantee”), then total regret is essentially bounded by √T , which means that if you max out the regret during training, after deployment it shouldn’t accumulate more than about 1/√T regret per time step, regardless of distributional shifts. Am I misunderstanding something here?

What does “X” refer to here?

It’s what we were talking about previously, the set of subtasks of “Try to verify that the current agent is benign.”

I don’t think that anything will be learning to approximate anything else well over all possible inputs.

Earlier, you wrote:

When we say that the “model can learn to do X” then we are saying roughly “the model competes with a set of hypotheses including one that does X.”

And I thought one possible interpretation of “do X” is “approximates X well over all possible inputs”. If that’s not what you meant by “do X”, what does it mean?

To step back a bit, I’m finding it hard to understand online guarantees because the “regret bounds” being proven are all relative to a class of hypotheses, but in practice we actually care about performance relative to the ground truth, so in addition to the kinds of regret bounds we see in that OL survey paper, we also need to characterize the performance of the hypothesis class relative to the ground truth, right? What do you expect that characterization to look like? I’m trying to understand your thinking / intuitions about this.

You aren’t misunderstanding the result, what you say makes complete sense. Now I understand your questions.

“Max out the regret during training” is extremely infeasible—for example, you’d need to know the exact regret bound (which we don’t), and much more importantly you’d need to do perfect adversarial training to find every possible mistake the model makes (which we can’t).

I’m just hoping to directly use the bound on the total loss.

That is, suppose that I’m trying to apply Amplify(A) to some “not too hard” task. Then I want to argue something like:

1. If we’ve built Amplify() correctly, then in order for Amplify(A) to succeed at a “not too hard” task, A just needs to solve some “not too hard” subtasks. (This is the main constraint on Amplify() other than preserving alignment, without this constraint it would be relatively easy.)

2. Because those subtasks are “not too hard,” we think that A has a regret bound vs. some model that solves those tasks well (on average). Roughly speaking this is the definition of “not too hard,” more on this below.

3. So if we keep applying Amplify(A) to the task, and training A on the resulting subtasks, then eventually A will solve those tasks well (on average). At that point, Amplify(A) succeeds at the original task (on average).

4. This argument goes through even if the task is changing over time, as long as the “not too hard” bound applies uniformly over the sequence (such that there is a single model in step #2 that succeeds for every element in the sequence).

In the original example we were discussing, I’m applying Amplify(A) to the task {test whether model B is robust}. When I talk about robustness as a subproblem of alignment, I mean precisely setting up the task {test whether model B is robust} in such a way that it’s “not too hard,” i.e. such that we can apply the above argument to the task {test whether model B is robust}.

Performing well on average is adequate for this task because I can keep applying Amplify(A) to the task {test whether model B is robust}, over and over again. If it fails 10% of the time, that just means 10% of the tests fail, not that model B can fail catastrophically on 10% of inputs or with 10% probability. (This is the magic we want out of a solution to robustness—to turn an average-case guarantee into a worst-case guarantee.)

And I thought one possible interpretation of “do X” is “approximates X well over all possible inputs”. If that’s not what you meant by “do X”, what does it mean?

The X comes with some performance measure. I mean that the model performs well enough on average over the distribution of subtasks.

(This still depends on the distribution of top-level tasks. For now you can imagine imposing this requirement for the worst case distribution of top-level tasks that can occur during training, though I think we can be a little bit more precise in practice.)

In practice we actually care about performance relative to the ground truth, so in addition to the kinds of regret bounds we see in that OL survey paper, we also need to characterize the performance of the hypothesis class relative to the ground truth, right?

Yes, we need to argue that there is some hypothesis in the class that is able to perform well. This is what I mean by “not too hard.” Ultimately we will be assuming that our ML is able to do something impactful in the world, and then trying to argue that if it was able to do that impactful thing, then we could also solve the subtasks necessary to do the same impactful thing safely (since the goal of this approach is to compete with unaligned ML).

In order to argue that a task is not that hard, we will use some combination of:

The assumption that our ML is good enough at some concrete tasks whose difficulty doesn’t scale with the sophistication of the models we are training. This can be verified empirically in advance, and once it’s true it tends to become more true.

The assumption that our ML system is good enough at doing something impactful in the world. (Which we’ll combine with some argument about how the tasks we are interested in are not harder than that task, or “not much harder” in the sense that the difference falls into the preceding category, of “easy” tasks that we’ve demonstrated empirically that our model can solve.)

Overall, things are not really going to be this formal, but hopefully this helps communicate the direction of my thinking/hopes.

“Max out the regret during training” is extremely infeasible

If you can’t max out the regret during training, then I’m having trouble seeing how to make use of such a regret bound the way you want to use it. Let’s say the current total regret is 1000 less than the provable regret bound. Then as far as the provable bound is concerned, the model could answer the next 1000 questions incorrectly and still satisfy the bound, so you can’t just repeat a task some number of times and then conclude that there’s a small probability that all of the answers will be wrong.

For any sequence of queries S, and any model M in the comparison class...

(average performance of A on S) > (average performance of M on S) - (arbitrarily small constant)

This gives you a lower bound on (average performance of A on S). That’s what I want.

Yes, you could get a bad result 1000 times in a row. To guarantee a good result in that setting, you’d need to run 1001 times (which will still probably be a tiny fraction of your overall training time).

What if during training you can’t come close to maxing out regret for the agents that have to be trained with human involvement? That “missing” regret might come due at any time after deployment, and has to be paid with additional oversight/feedback/training data in order for those agents to continue to perform well, right? (In other words, there could be a distributional shift that causes the agents to stop performing well without additional training.) But at that time human feedback may be horribly slow compared to how fast AIs think, thus forcing IDA to either not be competitive with other AIs or to press on without getting enough human feedback to ensure safety.

Am I misunderstanding anything here? (Are you perhaps assuming that we can max out regret during training for the agents that have to be trained with human involvement, but not necessarily for the higher level agents?)

That “missing” regret might come due at any time after deployment, and has to be paid with additional oversight/feedback/training data in order for those agents to continue to perform well, right? (In other words, there could be a distributional shift that causes the agents to stop performing well without additional training.)

Yes. (This is true for any ML system, though for an unaligned system the new training data can just come from the world itself.)

Are you perhaps assuming that we can max out regret during training for the agents that have to be trained with human involvement, but not necessarily for the higher level agents?

Yeah, I’m relatively optimistic that it’s possible to learn enough from humans that the lower level agent remains universal (+ aligned etc.) on arbitrary distributions. This would probably be the case if you managed to consistently break queries down into simpler pieces until arriving at a very simple queries. And of course it would also be the case if you could eliminate the human from the process altogether.

Failing either of those, it’s not clear whether we can do anything formally (vs. expanding the training distribution to cover the kinds of things that look like they might happen, having the human tasks be pretty abstract and independent from details of the situation that change, etc.) I’d still expect to be OK but we’d need to think about it more.

(I still think it’s 50%+ that we can reduce the human to small queries or eliminate them altogether, assuming that iterated amplification works at all, so would prefer start with the “does iterated amplification work at all” question.)

And note that we’ll be able to tell whether this is working, so in practice this is probably something that we can validate empirically—not something where we are going up against adversarial optimization pressure and so need a provable bound.

This is kind of surprising. (I had assumed that you need a provable bound since you talk about guarantees and cite a paper that talks about provable bounds.)

If you have some ML algorithm that only has an exponential provable bound but works well in practice, aren’t you worried that you might hit a hard instance of some task in the future that it would perform badly on, or there’s a context shift that causes a whole bunch of tasks to become harder to learn? Is the idea to detect that at run time and either pay the increased training cost or switch to another approach if that happens?

If you want to understand these intuitions in detail it likely requires doing the equivalent of a course in learning theory and reading a bunch of papers in the area (which doesn’t sound worth it to me, as a use of your time).

Ok, that’s good to know. I think the explanations you gave so far is good enough for my purposes at this point. (You might want to consider posting them somewhere easier to find with a warning similar to this one, so people don’t try to figure out what your intuitions are from the OL survey paper like I did.)

The top-level distribution for Amplify(A) puts 50% of its mass on “Try to verify that the current agent is benign.” (Or something like 50%, it’s <100% only because you are training A both to evaluate actions and to certify benignness.) If it performs badly on the

current agent, then it’s not performing well on average. There is no other distribution over distilled agents that we are defining the performance guarantee with respect to.Because the current agent is training over time, we really want to talk about online guarantees rather than assuming a fixed distribution. Amplify(A) is being trained on the sequence of tasks “Try to verify that the agent at t=0 is benign,” “Try to verify that the agent at t=1 is benign,” “Try to verify that the agent at t=2 is benign...” etc.

If there was a learnable model that performs well on most of the subtasks of “Try to verify that the current agent is benign,” then we get to infer that A in fact performs well on most of these subtasks. If amplification preserves average-case performance, then we get to conclude that Amplify(A) performs well on most tasks of the form “Try to verify that the current agent is benign.” If Amplify(A) consistently concludes that the agent is benign, e.g. if Amplify(A) consistently fails to find an input on which the agent behaves malignly, then we hope to conclude that the agent is in fact benign at most steps.

Thanks, I think I’m finally starting to understand your thinking here.

Sounds like I really need to figure out how online guarantees work in relation to your scheme. (It’s not clear to me how to map between your setup and the setup in the online learning survey that you cited, e.g., what corresponds to “receive true answer” after every prediction and what corresponds to the set of hypotheses that regret is being measured against.) I’ve been putting it off and just assuming a fixed distribution because you wrote “Overall, I don’t think this distinction is a huge deal.”

How do we determine this? (What if the current agent has moved into a part of the agent space such that there was no longer a learnable model that performs well on most of the subtasks of “Try to verify that the current agent is benign”?)

What does “most” mean here and why is that good enough? (If there are more than 100 steps and “most” means 99% then you can’t rule out having malign agents in some of the steps, which seems like a problem?)

As part of designing a technique for optimizing worst-case performance, we need to argue that the overseer’s job isn’t too hard (so that Amplify(A) is qualified to perform the task). If we remove this restriction, then optimizing worst case performance wouldn’t be scary—adversarial training would probably work fine.

See the section “Limited feedback (bandits)” starting on page 177. Online learning doesn’t require seeing the true answer.

We don’t compete with any explicit set of hypotheses. When we say that the “model can learn to do X” then we are saying roughly “the model competes with a set of hypotheses including one that does X.”

Most means most agents over the training process. But:

Once you have an agent that seems OK, you can freeze that agent and then run the certification process for significantly longer.

I expect the model is probably going to have some probability of behaving malignly on any given input anyway based on internal stochasticity. So you probably already need to do something based on ensembling / ensuring sufficient per-timestep robustness.

I’m still having trouble matching up that section with your setup. (It assumes that the agent sees the value of the loss function after every prediction, which I think is not the case in your setup?) Is Section 6 on Online Active Learning in this more comprehensive survey closer to what you have in mind? If so, can you say which of the subsections of Section 6 is the closest? Or alternatively, can you explain the actual formal setup and guarantee you’re hoping ML research will provide, which will be sufficient to accomplish what you need? (Or give an example of such formal setup/guarantee if multiple ones could work?)

Also, what if in the future the most competitive ML algorithms do not provide the kinds of guarantees you need? How likely do you think that is, and what’s the expected outcome (for your approach and AI alignment in general) conditional on that?

Don’t we need to know the size of the set of hypotheses in order to derive a regret bound?

You do get to see the loss function, if you couldn’t see the loss function then we couldn’t train A.

Amplify(A) is computed by calling A a bunch of times. The point of amplification is to set things up so that Amplify(A) will work well if the average call to A works well. A random subset of the calls to A are then evaluated (by calling Amplify(A)), so we get to see their loss.

(Obviously you get identical expected regret bounds if you evaluate an x fraction of decisions at random, just with 1/x times more regret—you get a regret bound on the sequence whose loss you evaluate, and that regret is at least x times the total.)

What does d (the number of bandit arms) correspond to in your setup? I’m guessing it’s the size of the hypothesis class that you’re competing with, which must be exponentially large? Since the total regret bound is 2√dlog(d)T (page 181, assuming you see the loss every round) it seems that you’d have to see an exponential number of losses (i.e., calls to Amplify(A)) before you could get a useful per-round guarantee. What am I missing here?

The d under the log is the size of the hypothesis class (which is exponential in this case). The other d parameterizes the difficulty of the exploration problem. Exp4 is the simplest algorithm that pulls those two parameters apart (though it’s obviously not a good algorithm for this case). It’s hard to formally capture “the difficulty of the exploration problem”, but intuitively it’s something like what you’d expect—how many options do you have to try at random before you are guaranteed to get useful signal? This is upper bounded by the number of output options. You can get tighter formal bounds in many cases but it’s one of those things where the real bound is kind of a problem-specific mess.

There are two hopes for not needing exponential time:

In imitation+RL, the exploration difficulty should depend on something like the accuracy of your imitation rather than on the size of the underlying domain (or maybe even better). You don’t have to try everything at random to get signal, if you have access to an expert who shows you a good option in each round. We can train A with demonstrations (we can get a demonstration just by calling Amplify(A)).

Many RL problems have tractable exploration despite large domains for a whole mess of complicated reasons.

(And note that we’ll be able to tell whether this is working, so in practice this is probably something that we can validate empirically—not something where we are going up against adversarial optimization pressure and so need a provable bound.)

This doesn’t seem likely to be the place where my approach gets stuck, and I don’t think it seems worth thinking about it that much until we’ve made much more progress on understanding the task that Amplify(A) actually needs to perform for robustness and on how amplification works more broadly, since (a) those are way more likely to be dealbreakers, in which case this doesn’t matter, (b) it’s relatively likely that other progress will change our conception of the learning theory problem we need to solve or obsolete it entirely.

If you want to understand these intuitions in detail it likely requires doing the equivalent of a course in learning theory and reading a bunch of papers in the area (which doesn’t sound worth it to me, as a use of your time). Overall this isn’t something where I feel excited about engaging in detail, except with experts in the relevant areas who I expect to know something or have intuitions that I don’t.

I have some additional questions that I’m still not sure about, that maybe you can answer without spending too much time.

It seems that if you can get a good online guarantee you are immune to distributional shifts (the online guarantee gives you a bound that’s not based on any assumptions about the input data). But to be practically meaningful the hypothesis class (on which the regret bound is based) has to include one that can approximate X well over all possible inputs, otherwise you could prove a good regret bound, and even perform well in practice, but still suffer badly from some future distributional shift, right?

Are you thinking that it will be possible to design X and the ML algorithm together such that we’ll know it can learn to approximate X well over all possible inputs, or alternatively are you thinking of some sort of online guarantee that does make some assumptions about the input data (e.g., that its distribution doesn’t change too quickly)? Or both (in a disjunctive sense)?

Are there any relevant papers you can point to that gave you the intuitions that you have on these questions?

The online guarantee says that on average, over a large sequence of trials, you will perform well. But if I train my system for a while and then deploy it, it could perform arbitrarily poorly after deployment (until I incorporate corrective data, which will generally be impossible for catastrophic failures).

I don’t understand this (might be related to the previous point). If there is a hypothesis that performs well over the sequence of actual cases that you train on, then you will perform well on the sequence of actual data cases that you train on. For any other inputs, the online guarantee doesn’t say anything.

I don’t think that anything will be learning to approximate anything else well over all possible inputs.

What does “X” refer to here?

I’m not imagining making any assumptions on the input data.

I don’t think I fully understood the questions.

Take the 2√dlog(d)T regret bound as an example. Suppose dlog(d) is small (what I meant by “a good online guarantee”), then total regret is essentially bounded by √T , which means that if you max out the regret during training, after deployment it shouldn’t accumulate more than about 1/√T regret per time step, regardless of distributional shifts. Am I misunderstanding something here?

It’s what we were talking about previously, the set of subtasks of “Try to verify that the current agent is benign.”

Earlier, you wrote:

And I thought one possible interpretation of “do X” is “approximates X well over all possible inputs”. If that’s not what you meant by “do X”, what does it mean?

To step back a bit, I’m finding it hard to understand online guarantees because the “regret bounds” being proven are all relative to a class of hypotheses, but in practice we actually care about performance relative to the ground truth, so in addition to the kinds of regret bounds we see in that OL survey paper, we also need to characterize the performance of the hypothesis class relative to the ground truth, right? What do you expect that characterization to look like? I’m trying to understand your thinking / intuitions about this.

You aren’t misunderstanding the result, what you say makes complete sense. Now I understand your questions.

“Max out the regret during training” is extremely infeasible—for example, you’d need to know the exact regret bound (which we don’t), and much more importantly you’d need to do perfect adversarial training to find every possible mistake the model makes (which we can’t).

I’m just hoping to directly use the bound on the total loss.

That is, suppose that I’m trying to apply Amplify(A) to some “not too hard” task. Then I want to argue something like:

1. If we’ve built Amplify() correctly, then in order for Amplify(A) to succeed at a “not too hard” task, A just needs to solve some “not too hard” subtasks. (This is the main constraint on Amplify() other than preserving alignment, without this constraint it would be relatively easy.)

2. Because those subtasks are “not too hard,” we think that A has a regret bound vs. some model that solves those tasks well (on average). Roughly speaking this is the definition of “not too hard,” more on this below.

3. So if we keep applying Amplify(A) to the task, and training A on the resulting subtasks, then eventually A will solve those tasks well (on average). At that point, Amplify(A) succeeds at the original task (on average).

4. This argument goes through even if the task is changing over time, as long as the “not too hard” bound applies uniformly over the sequence (such that there is a single model in step #2 that succeeds for every element in the sequence).

In the original example we were discussing, I’m applying Amplify(A) to the task {test whether model B is robust}. When I talk about robustness as a subproblem of alignment, I mean precisely setting up the task {test whether model B is robust} in such a way that it’s “not too hard,” i.e. such that we can apply the above argument to the task {test whether model B is robust}.

Performing well on average is adequate for this task because I can

keepapplying Amplify(A) to the task {test whether model B is robust}, over and over again. If it fails 10% of the time, that just means 10% of the tests fail, not that model B can fail catastrophically on 10% of inputs or with 10% probability. (This is the magic we want out of a solution to robustness—to turn an average-case guarantee into a worst-case guarantee.)The X comes with some performance measure. I mean that the model performs well enough on average over the distribution of subtasks.

(This still depends on the distribution of top-level tasks. For now you can imagine imposing this requirement for the worst case distribution of top-level tasks that can occur during training, though I think we can be a little bit more precise in practice.)

Yes, we need to argue that there is some hypothesis in the class that is able to perform well. This is what I mean by “not too hard.” Ultimately we will be assuming that our ML is able to do something impactful in the world, and then trying to argue that

ifit was able to do that impactful thing, then we could also solve the subtasks necessary to do the same impactful thing safely (since the goal of this approach is to compete with unaligned ML).In order to argue that a task is not that hard, we will use some combination of:

The assumption that our ML is good enough at some concrete tasks whose difficulty doesn’t scale with the sophistication of the models we are training. This can be verified empirically in advance, and once it’s true it tends to become more true.

The assumption that our ML system is good enough at doing something impactful in the world. (Which we’ll combine with some argument about how the tasks we are interested in are not harder than that task, or “not much harder” in the sense that the difference falls into the preceding category, of “easy” tasks that we’ve demonstrated empirically that our model can solve.)

Overall, things are not really going to be this formal, but hopefully this helps communicate the direction of my thinking/hopes.

If you can’t max out the regret during training, then I’m having trouble seeing how to make use of such a regret bound the way you want to use it. Let’s say the current total regret is 1000 less than the provable regret bound. Then as far as the provable bound is concerned, the model could answer the next 1000 questions incorrectly and still satisfy the bound, so you can’t just repeat a task some number of times and then conclude that there’s a small probability that all of the answers will be wrong.

If A satisfies a regret bound, then:

For any sequence of queries S, and any model M in the comparison class...

(average performance of A on S) > (average performance of M on S) - (arbitrarily small constant)

This gives you a lower bound on (average performance of A on S). That’s what I want.

Yes, you could get a bad result 1000 times in a row. To guarantee a good result in that setting, you’d need to run 1001 times (which will still probably be a tiny fraction of your overall training time).

What if during training you can’t come close to maxing out regret for the agents that have to be trained with human involvement? That “missing” regret might come due at any time after deployment, and has to be paid with additional oversight/feedback/training data in order for those agents to continue to perform well, right? (In other words, there could be a distributional shift that causes the agents to stop performing well without additional training.) But at that time human feedback may be horribly slow compared to how fast AIs think, thus forcing IDA to either not be competitive with other AIs or to press on without getting enough human feedback to ensure safety.

Am I misunderstanding anything here? (Are you perhaps assuming that we can max out regret during training for the agents that have to be trained with human involvement, but not necessarily for the higher level agents?)

Yes. (This is true for any ML system, though for an unaligned system the new training data can just come from the world itself.)

Yeah, I’m relatively optimistic that it’s possible to learn enough from humans that the lower level agent remains universal (+ aligned etc.) on arbitrary distributions. This would probably be the case if you managed to consistently break queries down into simpler pieces until arriving at a very simple queries. And of course it would also be the case if you could eliminate the human from the process altogether.

Failing either of those, it’s not clear whether we can do anything formally (vs. expanding the training distribution to cover the kinds of things that look like they might happen, having the human tasks be pretty abstract and independent from details of the situation that change,

etc.)I’d still expect to be OK but we’d need to think about it more.(I still think it’s 50%+ that we can reduce the human to small queries or eliminate them altogether, assuming that iterated amplification works at all, so would prefer start with the “does iterated amplification work at all” question.)

This is kind of surprising. (I had assumed that you need a provable bound since you talk about guarantees and cite a paper that talks about provable bounds.)

If you have some ML algorithm that only has an exponential provable bound but works well in practice, aren’t you worried that you might hit a hard instance of some task in the future that it would perform badly on, or there’s a context shift that causes a whole bunch of tasks to become harder to learn? Is the idea to detect that at run time and either pay the increased training cost or switch to another approach if that happens?

Ok, that’s good to know. I think the explanations you gave so far is good enough for my purposes at this point. (You might want to consider posting them somewhere easier to find with a warning similar to this one, so people don’t try to figure out what your intuitions are from the OL survey paper like I did.)