But sadly, you don’t have any guarantee that it will output the optimal element
If I understand the setup correctly, there’s no guarantee that the optimal element would be good, right? It’s just likely since the optimal element a priori shouldn’t be unusually bad, and you’re assuming most satisficing elements are fine.
This initially threw me off regarding what problem you’re trying to solve. My best current guess is:
We’re assuming that if we could get a random satisficing action, we’d be happy with that with high probability. (So intuitively, we’re not asking for extremely hard-to-achieve outcomes relative to how well-specified the objective is.)
So the only problem is how to randomly sample from the set of satisficing actions computationally efficiently, which is what this post is trying to solve, assuming access to an oracle that gives adversarial satisficing actions.
As an example, we might want to achieve outcomes that require somewhat superhuman intelligence. Our objective specification is very good, but it leaves some room for an adversary to mess with us while satisficing. We’re worried about an adversary because we had to train this somewhat superhuman AI, which may have different goals than just doing well on the objective.
If this is right, then I think stating these assumptions and the problem of sampling efficiently at the beginning would have avoided much of my confusion (and looking at other comments, I’d guess others also had differing impressions of what this post is trying to do).
I’m still unsure about how useful this problem setup is. For example, we’d probably want to train the weakest system that can give us satisficing outputs (rather than having an infinitely intelligent oracle). In that case, adding more constraints might mean training an overall stronger system or making some other concession, and it’s unclear to me how that trades off with the advantages you’re aiming for in practice. A related intuition is: we only have problems in this setting if the AI that comes up with plans understands some things about these plans that the objective function “doesn’t understand” (which sounds weird to say about a function, but in practice, I assume the objective is implicitly defined by some scalable oversight process or some other intelligent things). I’m not sure whether that needs to be the case (though it does seem possible that it’d be hard to avoid, I’m pretty unsure).
Thanks for your comment! I didn’t get around to answering earlier, but maybe it’s still useful to try to clarify a few things.
If I understand the setup correctly, there’s no guarantee that the optimal element would be good, right?
My threat model here is that we have access to an Oracle that’s not trustworthy (as specified in the first paragraph), so that even if we were able to specify our preferences correctly, we would still have a problem. So in this context you could assume that we managed to specify our preferences correctly. If our problem is simply that we misspecified our preferences (this would roughly correspond to “fail at outer alignment” vs my threat model of “fail at inner alignment), solving this by soft-maximization is much easier: Just “flatten” the top part of your utility function (i.e. make all outputs that satisfice have the same utility) and add some noise, then hand it to an optimizer.
So I guess the point I tried to make with my post could be stated as “Soft-optimization can also be used to help with inner alignment, not just outer alignment” (and I freely admit that I should have said so in the post).
It’s just likely since the optimal element a priori shouldn’t be unusually bad, and you’re assuming most satisficing elements are fine.
I’m not just assuming that, I gave a bit of an argument for why I believe that to be that case: I assume that an output is not randomly dangerous, but only dangerous if it was specifically chosen to achieve a goal that differs from ours. This only holds if the goal is not too ambitious, e.g. if we ask the AI controlling the paperclip factory for 10^30 paperclips that will not go well.
As I understand, gwern has a stronger opinion on that and believes that side-effects from less ambitious plans are still an important concern. But at least my proposal gets rid of the potentially adversarial selection by the untrusted optimizer, so I believe that getting safe outputs (e.g. by combining our goal with impact regularization) should be much easier.
I’m still unsure about how useful this problem setup is. For example, we’d probably want to train the weakest system that can give us satisficing outputs (rather than having an infinitely intelligent oracle).
My assumption about the black-box untrusted optimizer is a worst-case assumption. So I’m trying to understand if we could get something useful out of the system even in that case. If we can make stronger assumptions about what we know about the system, the problem gets of course much easier. Given the general state of progress in interpretability compared with progress in capabilities, I think it’s worth thinking about this worst-case scenario a bit.
I like your suggestion of “train the weakest system that can give us satisficing outputs”, but I think doing that might actually be quite difficult: How do we measure (or otherwise predict) how weak/strong our system and how easy/difficult the satisficing task is? Can these things even be measured on a 1-d scale? Trying to increase to capability slowly until the system solves the task runs into the problem mentioned in footnote 10 if the systems deceptively pretends to be weaker than it is.
If I understand the setup correctly, there’s no guarantee that the optimal element would be good, right? It’s just likely since the optimal element a priori shouldn’t be unusually bad, and you’re assuming most satisficing elements are fine.
This initially threw me off regarding what problem you’re trying to solve. My best current guess is:
We’re assuming that if we could get a random satisficing action, we’d be happy with that with high probability. (So intuitively, we’re not asking for extremely hard-to-achieve outcomes relative to how well-specified the objective is.)
So the only problem is how to randomly sample from the set of satisficing actions computationally efficiently, which is what this post is trying to solve, assuming access to an oracle that gives adversarial satisficing actions.
As an example, we might want to achieve outcomes that require somewhat superhuman intelligence. Our objective specification is very good, but it leaves some room for an adversary to mess with us while satisficing. We’re worried about an adversary because we had to train this somewhat superhuman AI, which may have different goals than just doing well on the objective.
If this is right, then I think stating these assumptions and the problem of sampling efficiently at the beginning would have avoided much of my confusion (and looking at other comments, I’d guess others also had differing impressions of what this post is trying to do).
I’m still unsure about how useful this problem setup is. For example, we’d probably want to train the weakest system that can give us satisficing outputs (rather than having an infinitely intelligent oracle). In that case, adding more constraints might mean training an overall stronger system or making some other concession, and it’s unclear to me how that trades off with the advantages you’re aiming for in practice. A related intuition is: we only have problems in this setting if the AI that comes up with plans understands some things about these plans that the objective function “doesn’t understand” (which sounds weird to say about a function, but in practice, I assume the objective is implicitly defined by some scalable oversight process or some other intelligent things). I’m not sure whether that needs to be the case (though it does seem possible that it’d be hard to avoid, I’m pretty unsure).
Thanks for your comment! I didn’t get around to answering earlier, but maybe it’s still useful to try to clarify a few things.
My threat model here is that we have access to an Oracle that’s not trustworthy (as specified in the first paragraph), so that even if we were able to specify our preferences correctly, we would still have a problem. So in this context you could assume that we managed to specify our preferences correctly. If our problem is simply that we misspecified our preferences (this would roughly correspond to “fail at outer alignment” vs my threat model of “fail at inner alignment), solving this by soft-maximization is much easier: Just “flatten” the top part of your utility function (i.e. make all outputs that satisfice have the same utility) and add some noise, then hand it to an optimizer.
So I guess the point I tried to make with my post could be stated as “Soft-optimization can also be used to help with inner alignment, not just outer alignment” (and I freely admit that I should have said so in the post).
I’m not just assuming that, I gave a bit of an argument for why I believe that to be that case: I assume that an output is not randomly dangerous, but only dangerous if it was specifically chosen to achieve a goal that differs from ours. This only holds if the goal is not too ambitious, e.g. if we ask the AI controlling the paperclip factory for 10^30 paperclips that will not go well.
As I understand, gwern has a stronger opinion on that and believes that side-effects from less ambitious plans are still an important concern. But at least my proposal gets rid of the potentially adversarial selection by the untrusted optimizer, so I believe that getting safe outputs (e.g. by combining our goal with impact regularization) should be much easier.
My assumption about the black-box untrusted optimizer is a worst-case assumption. So I’m trying to understand if we could get something useful out of the system even in that case. If we can make stronger assumptions about what we know about the system, the problem gets of course much easier. Given the general state of progress in interpretability compared with progress in capabilities, I think it’s worth thinking about this worst-case scenario a bit.
I like your suggestion of “train the weakest system that can give us satisficing outputs”, but I think doing that might actually be quite difficult: How do we measure (or otherwise predict) how weak/strong our system and how easy/difficult the satisficing task is? Can these things even be measured on a 1-d scale? Trying to increase to capability slowly until the system solves the task runs into the problem mentioned in footnote 10 if the systems deceptively pretends to be weaker than it is.