EDIT: I’m not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.
Oops I should have been clearer. 1.5 and 2.5 are not important parts of my version, the important part is updating the weaker model.
In the limit where the weaker model is infinitely powerful and cared only about maximizing reward, you are correct, it will reward the strong model until the strong model’s advice maximizes its own reward, and it will follow that advice exactly.
You are completely correct, that whatever reason we trust the weaker model is “doing a lot of the work.”
However, my version has the potential achieve the same capability as a typical RL model (e.g. o3) while being safer. If the model that evaluates ideas is less optimized and more “human-like,” while the model that generates ideas is more optimized and more “alien-like,” then the resulting ideas the model actually follows will resemble ideas humans will look at and say “wow that is brilliant, I could’ve never thought of that, and it works!” rather than ideas humans will look at and say “what the heck is that? Huh, it works?! I could’ve never predicted that it would work.”
Furthermore, the “values” of the system will be less affected by RL, since the evaluator model has more control over the values, and it is relatively less optimized and more “human-like.”
Given the same level of capability, it is safer.
These advantages are “automatic,” they only require the evaluator model to have relatively less RL than the generator model. If you go one step further, and use other alignment technologies with an efficient alignment tax on the evaluator model, it can get better, since the evaluator model gets even smarter while staying aligned.
Pure MONA is a special case of this generalized version, where the evaluator model has exactly zero RL (at least for the context where it is advised by a stronger model). It is like adjusting everything to the state of maximum safety and maximum alignment tax.
With pure MONA, it is probably even safer given the level of capability, but… can it reach the same level of capability?
Capabilities depend on the difficulty of evaluating a good idea compared to generating a good idea:
For tasks where evaluating good ideas/advice is obvious, then pure MONA might work just as well at the best RL models (e.g. o3).
For tasks where evaluating good ideas/advice is hard but still easier than generating them, my version might be needed. MONA’s generator model may be too sycophantic.
For tasks where evaluating good ideas/advice is even harder than generating them, then both MONA and my idea are worthless because the generator model isn’t helping, and the weaker evaluator model is essentially working on its own.
Yeah, if you’re keeping the evaluator aligned solely by not RLing it very much, I feel safer than if you RLed it a lot, but not a lot safer. Ideally there would be some more theoretically-justified reason why we expect it to be safe. Hopefully MONA just works pretty well without any RL training of the evaluator, but if not, RLing the evaluator a little bit is something we could try.
If you’re using additional techniques to make the evaluator act more human-like despite its RL, perhaps you could just apply those techniques to the agent model instead.
Yes, avoiding RL would be the best case scenario. I admit that my idea may be a sort of backup in case MONA falls short in capabilities.
If additional alignment techniques act on both the generator and the evaluator, they may reduce the capabilities too much due to their alignment tax. If they only act on the evaluator, the capabilities due to the generator’s smart ideas stay, while the alignment due to the evaluator’s aligned final decisions control the whole agent.
EDIT: I’m not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.
Oops I should have been clearer. 1.5 and 2.5 are not important parts of my version, the important part is updating the weaker model.
In the limit where the weaker model is infinitely powerful and cared only about maximizing reward, you are correct, it will reward the strong model until the strong model’s advice maximizes its own reward, and it will follow that advice exactly.
You are completely correct, that whatever reason we trust the weaker model is “doing a lot of the work.”
However, my version has the potential achieve the same capability as a typical RL model (e.g. o3) while being safer. If the model that evaluates ideas is less optimized and more “human-like,” while the model that generates ideas is more optimized and more “alien-like,” then the resulting ideas the model actually follows will resemble ideas humans will look at and say “wow that is brilliant, I could’ve never thought of that, and it works!” rather than ideas humans will look at and say “what the heck is that? Huh, it works?! I could’ve never predicted that it would work.”
Furthermore, the “values” of the system will be less affected by RL, since the evaluator model has more control over the values, and it is relatively less optimized and more “human-like.”
Given the same level of capability, it is safer.
These advantages are “automatic,” they only require the evaluator model to have relatively less RL than the generator model. If you go one step further, and use other alignment technologies with an efficient alignment tax on the evaluator model, it can get better, since the evaluator model gets even smarter while staying aligned.
Pure MONA is a special case of this generalized version, where the evaluator model has exactly zero RL (at least for the context where it is advised by a stronger model). It is like adjusting everything to the state of maximum safety and maximum alignment tax.
With pure MONA, it is probably even safer given the level of capability, but… can it reach the same level of capability?
Capabilities depend on the difficulty of evaluating a good idea compared to generating a good idea:
For tasks where evaluating good ideas/advice is obvious, then pure MONA might work just as well at the best RL models (e.g. o3).
For tasks where evaluating good ideas/advice is hard but still easier than generating them, my version might be needed. MONA’s generator model may be too sycophantic.
For tasks where evaluating good ideas/advice is even harder than generating them, then both MONA and my idea are worthless because the generator model isn’t helping, and the weaker evaluator model is essentially working on its own.
Yeah, if you’re keeping the evaluator aligned solely by not RLing it very much, I feel safer than if you RLed it a lot, but not a lot safer. Ideally there would be some more theoretically-justified reason why we expect it to be safe. Hopefully MONA just works pretty well without any RL training of the evaluator, but if not, RLing the evaluator a little bit is something we could try.
If you’re using additional techniques to make the evaluator act more human-like despite its RL, perhaps you could just apply those techniques to the agent model instead.
Yes, avoiding RL would be the best case scenario. I admit that my idea may be a sort of backup in case MONA falls short in capabilities.
If additional alignment techniques act on both the generator and the evaluator, they may reduce the capabilities too much due to their alignment tax. If they only act on the evaluator, the capabilities due to the generator’s smart ideas stay, while the alignment due to the evaluator’s aligned final decisions control the whole agent.