My interpretation here is that, compared to prompt engineering, RLHF elicits a larger % of the model’s capabilities, which makes it safer because the gap between “capability you can elicit” and “underlying capability capacity” is smaller.
For instance, maybe we want to demonstrate a model has excellent hacking capabilities; and say this corresponds to a performance of 70. We pre-train a model which has performance of 50 without anything special, but the pre-trained model has a lot of potential, and we think it has a max capability of 100. Using RLHF boosts performance by 40% from pre-trained, so it could get our model’s performance up to 70, but using just prompt engineering isn’t so effective, it only boosts our model’s capabilities by 20%, up to 60. So, if we want to elicit the performance level 70 using prompt engineering, we need a model that is more powerful at baseline (and thus has a more powerful potential). Math: 1.2 x = 70, so x = 58.3 is the model’s pre-trained performance, and such a model has a max capability of 58.3 * 2 = 116.7. So in this example, in order to get the desired behavior, we need a less powerful model with RLHF or a more powerful model with prompt engineering.
Why it might be good to have models with lower potential capabilities:
we might be worried about sudden jumps in capability from being able to better utilize pre-trained models. If “ability we are able to elicit” is only slightly below “maximum possible ability” then this is less of a problem.
we might be worried about a treacherous turn in which an AI is able to utilize almost all of its potential capabilities; we would be taken by surprise and in trouble if this included capabilities much greater than what we have (or even what we know exist).
accidents will be less bad if our model has overall lower capability potential
[edited to add] This list is not meant to be comprehensive, but here’s an important one: we might expect models with capabilities above some threshold to be deceptive and capable of a treacherous turn. In some sense, we want to get close to this line without crossing it, in order to take actions which protect the world from misaligned AIs. In the above example, we could imagine that getting a max capability above 105 results in a deceptive model which can execute a treacherous turn. Using RLHF on the 100-max model allows us to get the behavior we wanted, but if we are using prompt engineering than we would need to make a dangerous model to get the desired behavior.
My interpretation here is that, compared to prompt engineering, RLHF elicits a larger % of the model’s capabilities, which makes it safer because the gap between “capability you can elicit” and “underlying capability capacity” is smaller.
For instance, maybe we want to demonstrate a model has excellent hacking capabilities; and say this corresponds to a performance of 70. We pre-train a model which has performance of 50 without anything special, but the pre-trained model has a lot of potential, and we think it has a max capability of 100. Using RLHF boosts performance by 40% from pre-trained, so it could get our model’s performance up to 70, but using just prompt engineering isn’t so effective, it only boosts our model’s capabilities by 20%, up to 60. So, if we want to elicit the performance level 70 using prompt engineering, we need a model that is more powerful at baseline (and thus has a more powerful potential). Math: 1.2 x = 70, so x = 58.3 is the model’s pre-trained performance, and such a model has a max capability of 58.3 * 2 = 116.7. So in this example, in order to get the desired behavior, we need a less powerful model with RLHF or a more powerful model with prompt engineering.
Why it might be good to have models with lower potential capabilities:
we might be worried about sudden jumps in capability from being able to better utilize pre-trained models. If “ability we are able to elicit” is only slightly below “maximum possible ability” then this is less of a problem.
we might be worried about a treacherous turn in which an AI is able to utilize almost all of its potential capabilities; we would be taken by surprise and in trouble if this included capabilities much greater than what we have (or even what we know exist).
accidents will be less bad if our model has overall lower capability potential
[edited to add] This list is not meant to be comprehensive, but here’s an important one: we might expect models with capabilities above some threshold to be deceptive and capable of a treacherous turn. In some sense, we want to get close to this line without crossing it, in order to take actions which protect the world from misaligned AIs. In the above example, we could imagine that getting a max capability above 105 results in a deceptive model which can execute a treacherous turn. Using RLHF on the 100-max model allows us to get the behavior we wanted, but if we are using prompt engineering than we would need to make a dangerous model to get the desired behavior.