RL capability gains might mostly come from better self-elicitation.
Ran across a paper NUDGING: Inference-time Alignment of LLMs via Guided Decoding. The authors took a base model and a post-trained model. They had the base model try to answer benchmark questions, found the positions where the base model was least certain, and replaced specifically those tokens with tokens from the post-trained model. The base model, so steered, performed surprisingly well on benchmarks. Surprisingly (to me at least), the tokens changed tended to be transitional phrases rather than the meat of the specific problems.
Example from the paper:
This worked even when the post-trained model was significantly smaller than the base model: on gsm8k, llama-2-7b-chat “nudging” llama-2-70b (base) scored 46.2 on gsm8k, while 7b-chat alone scored 25.5. 70b-chat barely scored better, at 48.5.
Surprisingly, I haven’t seen much discussion of this paper on here. It seems very relevant to the question of whether RL bakes new behaviors into models or makes them better at eliciting behaviors they already know how to execute in appropriate situations.
I am tempted to do a longer writeup and attempt to reproduce/extend the paper, if there’s interest.
The way they use the word “aligned” in that paper is very weird to me :P (they basically use it as a synonym for “instruction-following” or “post-trained”).
But I feel like this method could actually be adapted for AI alignment/safety. It’s kind of similar to my “incremental steering” idea, but instead of a strong untrusted model guiding a weak trusted model, there’s a weak post-trained model guiding a strong base model. This also looks more practical than incremental steering, because it alternates between the weak model and the strong model, rather than giving the weak model a single token of guidance and then leaving it entirely on its own.
Here’s one thing you could do: rather than using an off-the-shelf strong model, you train the strong model to get reward (making it untrusted). But it can only do so by inserting tokens into the weak model’s CoT. (The strong model decides when to insert a token, rather than doing this automatically whenever the weak model is uncertain.) Each token the strong model inserts is associated with a penalty, so it has to use as few tokens as possible. Unlike in the Nudging paper, the weak model might not be a base model—the important thing is that we trust the weak model isn’t motivated to scheme/reward-hack/be misaligned.
Safety benefits: Because the strong/untrusted model has to rely on the weak/trusted model to do most of its thinking for it, it should be harder for it to do steganography or any kind of thinking that the weak model wouldn’t naturally do. Also, if you’re auditing the strong model, this should be easier, because you can just look very carefully at the relatively-few tokens written by the strong model and think about how they influence the weak model.
I’d be interested in a longer writeup and/or whatever other experiments you want to try!
RL capability gains might mostly come from better self-elicitation.
Ran across a paper NUDGING: Inference-time Alignment of LLMs via Guided Decoding. The authors took a base model and a post-trained model. They had the base model try to answer benchmark questions, found the positions where the base model was least certain, and replaced specifically those tokens with tokens from the post-trained model. The base model, so steered, performed surprisingly well on benchmarks. Surprisingly (to me at least), the tokens changed tended to be transitional phrases rather than the meat of the specific problems.
Example from the paper:
This worked even when the post-trained model was significantly smaller than the base model: on gsm8k, llama-2-7b-chat “nudging” llama-2-70b (base) scored 46.2 on gsm8k, while 7b-chat alone scored 25.5. 70b-chat barely scored better, at 48.5.
Surprisingly, I haven’t seen much discussion of this paper on here. It seems very relevant to the question of whether RL bakes new behaviors into models or makes them better at eliciting behaviors they already know how to execute in appropriate situations.
I am tempted to do a longer writeup and attempt to reproduce/extend the paper, if there’s interest.
The way they use the word “aligned” in that paper is very weird to me :P (they basically use it as a synonym for “instruction-following” or “post-trained”).
But I feel like this method could actually be adapted for AI alignment/safety. It’s kind of similar to my “incremental steering” idea, but instead of a strong untrusted model guiding a weak trusted model, there’s a weak post-trained model guiding a strong base model. This also looks more practical than incremental steering, because it alternates between the weak model and the strong model, rather than giving the weak model a single token of guidance and then leaving it entirely on its own.
Here’s one thing you could do: rather than using an off-the-shelf strong model, you train the strong model to get reward (making it untrusted). But it can only do so by inserting tokens into the weak model’s CoT. (The strong model decides when to insert a token, rather than doing this automatically whenever the weak model is uncertain.) Each token the strong model inserts is associated with a penalty, so it has to use as few tokens as possible. Unlike in the Nudging paper, the weak model might not be a base model—the important thing is that we trust the weak model isn’t motivated to scheme/reward-hack/be misaligned.
Safety benefits: Because the strong/untrusted model has to rely on the weak/trusted model to do most of its thinking for it, it should be harder for it to do steganography or any kind of thinking that the weak model wouldn’t naturally do. Also, if you’re auditing the strong model, this should be easier, because you can just look very carefully at the relatively-few tokens written by the strong model and think about how they influence the weak model.
I’d be interested in a longer writeup and/or whatever other experiments you want to try!
Cool result! Do you know they used Llama 2 instead of Llama 3? The paper was released recently.