I agree that it’s bad to assume that the training process is dumb, but I think it’s fine to assume that it’ll be slow and continuous such that there’s always some previous model such that an amplified version of it can oversee the current model.
Why do you think it’s fine to assume this? (Do you think this is unlikely to change in the future, or just that this is the limit of the scope of the problem that you’re working on?)
Yeah, this is where you’re relying on the human more. For early models, I think you’re mostly relying on the human having access to good enough interpretability tools that they can understand simple models without help.
in the pure supervised case just means letting a human answer the question given access to the model
This looks like a big disconnect between us. The thing that touched off this discussion was Ought’s switch from Factored Cognition to Factored Evaluation, and Rohin’s explanation: “In iterated amplification (AN #30), when decomposing tasks in the Factored Cognition sense, you would use imitation learning during the distillation step, whereas with Factored Evaluation, you would use reinforcement learning to optimize the evaluation signal.”
I think if we’re using SL for the question answering part and only using RL for “oversight” (“trying to get M to be transparent and to verify that it is in fact doing the right thing”) then I’m a lot more optimistic since we only have to worry about security / reward gaming problems in the latter part, and we can do things like making it a constraint and doing quatilization without worrying about competitiveness. But I think Paul and Ought’s plan is to use RL for both. In that case it doesn’t help much to make “oversight” a constraint since the security / reward gaming problem in the “answer evaluation” part would still be there. And the problem just generally seems a lot harder because there could be so many different kinds of flaws in the “answer evaluation” part that could be exploited.
Why do you think it’s fine to assume this? (Do you think this is unlikely to change in the future, or just that this is the limit of the scope of the problem that you’re working on?)
I think I would be fairly surprised if future ML techniques weren’t smooth in this way, so I think it’s a pretty reasonable assumption.
This seems to be assuming High Bandwidth Overseer. What about LBO?
A low-bandwidth overseer seems unlikely to be competitive to me. Though it’d be nice if it worked, I think you’ll probably want to solve the problem of weird hacky inputs via something like filtered-HCH instead. That being said, I expect the human to drop out of the process fairly quickly—it’s mostly only useful in the beginning before the model learns how to do decompositions properly—at some point you’ll want to switch to implementing Amp(M) as M consulting M rather than H consulting M.
This looks like a big disconnect between us. The thing that touched off this discussion was Ought’s switch from Factored Cognition to Factored Evaluation, and Rohin’s explanation: “In iterated amplification (AN #30), when decomposing tasks in the Factored Cognition sense, you would use imitation learning during the distillation step, whereas with Factored Evaluation, you would use reinforcement learning to optimize the evaluation signal.”
I think if we’re using SL for the question answering part and only using RL for “oversight” (“trying to get M to be transparent and to verify that it is in fact doing the right thing”) then I’m a lot more optimistic since we only have to worry about security / reward gaming problems in the latter part, and we can do things like making it a constraint and doing quatilization without worrying about competitiveness. But I think Paul and Ought’s plan is to use RL for both. In that case it doesn’t help much to make “oversight” a constraint since the security / reward gaming problem in the “answer evaluation” part would still be there. And the problem just generally seems a lot harder because there could be so many different kinds of flaws in the “answer evaluation” part that could be exploited.
I mostly agree with this and it is a disagreement I have with Paul in that I am more skeptical of relaxing the supervised setting. That being said, you definitely can still make oversight a constraint even if you’re optimizing an RL signal and I do think it helps, since it gives you a way to separately verify that the system is actually being transparent. The idea in this sort of a setting would be that, if your system achieves high performance on the RL signal, then it must be outputting answers which the amplified human likes—but then the concern is that it might be tricking the human or something. But then if you can use oversight to look inside the model and verify that it’s actually being transparent, then you can rule out that possibility. By making the transparency part a constraint rather than an objective, it might help prevent the model from gaming the transparency part, which I expect to be the most important part and in turn could help you detect if there was any gaming going on of the RL signal.
For the record, though, I don’t currently think that making the transparency part a constraint is a good idea. First, because I expect transparency to be hard enough that you’ll want to be able to benefit from having a strong gradient towards it. And second, because I don’t think it actually helps prevent gaming very much: even if your training process doesn’t explicitly incentivize gaming, I expect that by default many mesa-optimizers will have objectives that benefit from it. Thus, what you really want is a general solution for preventing your mesa-optimizer from ever doing anything like that, which I expect to be something like corrigibility or myopia, rather than just trying to rely on your training process not incentivizing it.
I think I would be fairly surprised if future ML techniques weren’t smooth in this way, so I think it’s a pretty reasonable assumption.
This is kind of tangential at this point, but I’m not so sure about this. Humans can sometimes optimize things without being slow and continuous, so there must be algorithms that can do this, which can be invented or itself produced via (dumber) ML. As another intuition pump, suppose the algorithm is just gradient descent with some added pattern recognizers that can say “hey, I see where this is going, let’s jump directly there.”
A low-bandwidth overseer seems unlikely to be competitive to me.
Has this been written up anywhere, and is it something that Paul agrees with? (I think last time he talked about HBO vs LBO, he was still 50⁄50 on them.)
I mostly agree with this and it is a disagreement I have with Paul in that I am more skeptical of relaxing the supervised setting.
Ah ok, so your earlier comments were addressing a different and easier problem than the one I have in mind, and that’s (mostly) why you sounded more optimistic than me.
Do you have a good sense of why Paul disagrees with you, and if so can you explain?
(I haven’t digested your mechanistic corrigibility post yet. May have more to say after I do that.)
Why do you think it’s fine to assume this? (Do you think this is unlikely to change in the future, or just that this is the limit of the scope of the problem that you’re working on?)
This seems to be assuming High Bandwidth Overseer. What about LBO?
This looks like a big disconnect between us. The thing that touched off this discussion was Ought’s switch from Factored Cognition to Factored Evaluation, and Rohin’s explanation: “In iterated amplification (AN #30), when decomposing tasks in the Factored Cognition sense, you would use imitation learning during the distillation step, whereas with Factored Evaluation, you would use reinforcement learning to optimize the evaluation signal.”
I think if we’re using SL for the question answering part and only using RL for “oversight” (“trying to get M to be transparent and to verify that it is in fact doing the right thing”) then I’m a lot more optimistic since we only have to worry about security / reward gaming problems in the latter part, and we can do things like making it a constraint and doing quatilization without worrying about competitiveness. But I think Paul and Ought’s plan is to use RL for both. In that case it doesn’t help much to make “oversight” a constraint since the security / reward gaming problem in the “answer evaluation” part would still be there. And the problem just generally seems a lot harder because there could be so many different kinds of flaws in the “answer evaluation” part that could be exploited.
I think I would be fairly surprised if future ML techniques weren’t smooth in this way, so I think it’s a pretty reasonable assumption.
A low-bandwidth overseer seems unlikely to be competitive to me. Though it’d be nice if it worked, I think you’ll probably want to solve the problem of weird hacky inputs via something like filtered-HCH instead. That being said, I expect the human to drop out of the process fairly quickly—it’s mostly only useful in the beginning before the model learns how to do decompositions properly—at some point you’ll want to switch to implementing Amp(M) as M consulting M rather than H consulting M.
I mostly agree with this and it is a disagreement I have with Paul in that I am more skeptical of relaxing the supervised setting. That being said, you definitely can still make oversight a constraint even if you’re optimizing an RL signal and I do think it helps, since it gives you a way to separately verify that the system is actually being transparent. The idea in this sort of a setting would be that, if your system achieves high performance on the RL signal, then it must be outputting answers which the amplified human likes—but then the concern is that it might be tricking the human or something. But then if you can use oversight to look inside the model and verify that it’s actually being transparent, then you can rule out that possibility. By making the transparency part a constraint rather than an objective, it might help prevent the model from gaming the transparency part, which I expect to be the most important part and in turn could help you detect if there was any gaming going on of the RL signal.
For the record, though, I don’t currently think that making the transparency part a constraint is a good idea. First, because I expect transparency to be hard enough that you’ll want to be able to benefit from having a strong gradient towards it. And second, because I don’t think it actually helps prevent gaming very much: even if your training process doesn’t explicitly incentivize gaming, I expect that by default many mesa-optimizers will have objectives that benefit from it. Thus, what you really want is a general solution for preventing your mesa-optimizer from ever doing anything like that, which I expect to be something like corrigibility or myopia, rather than just trying to rely on your training process not incentivizing it.
This is kind of tangential at this point, but I’m not so sure about this. Humans can sometimes optimize things without being slow and continuous, so there must be algorithms that can do this, which can be invented or itself produced via (dumber) ML. As another intuition pump, suppose the algorithm is just gradient descent with some added pattern recognizers that can say “hey, I see where this is going, let’s jump directly there.”
Has this been written up anywhere, and is it something that Paul agrees with? (I think last time he talked about HBO vs LBO, he was still 50⁄50 on them.)
Ah ok, so your earlier comments were addressing a different and easier problem than the one I have in mind, and that’s (mostly) why you sounded more optimistic than me.
Do you have a good sense of why Paul disagrees with you, and if so can you explain?
(I haven’t digested your mechanistic corrigibility post yet. May have more to say after I do that.)