Thanks for calling out my paper! I recommend you take a look at it @Emil Ryd—it’s relevant to what you’re doing here. In the paper, I find that an effective way to get an LLM to verbalize reasoning that can be done in a single forward pass is to optimize a system prompt to improve LLM performance on the task. Let me know if you have any questions about it!
@StanislavKrym One concern with this plan is that if you let the SFT-then-RL-model see the task, it could just say “I will output X” (where X is the correct answer). Then the untrained SFT model will respond with “X” as well. Using a single system prompt that is constant across all tasks solves this problem, since the trained model doesn’t get to see the task itself.
One potential follow-up to our paper I’ve been thinking about, which is similar to your idea: teach a model to do various tasks, then teach it to correctly prompt an untrained version of itself to do the tasks. I’d be excited for someone to work on this!
IIUC, in your paper you try to extract the internal reasoning an LLM does by optimizing a prompt, but you don’t actually get the LLM itself to verbalize its reasoning. This is cool, but it’s not really checking whether the model is able to verbalize its internal processes (i.e. self-identify and describe its reward hacks), right?
I agree that an interesting test would be whether a model that has been trained to do the tricks/tasks would be better at optimizing the prompts. It’s leveraging a similar idea to non-assistant auditing techinques.
Thanks for calling out my paper! I recommend you take a look at it @Emil Ryd—it’s relevant to what you’re doing here. In the paper, I find that an effective way to get an LLM to verbalize reasoning that can be done in a single forward pass is to optimize a system prompt to improve LLM performance on the task. Let me know if you have any questions about it!
@StanislavKrym One concern with this plan is that if you let the SFT-then-RL-model see the task, it could just say “I will output X” (where X is the correct answer). Then the untrained SFT model will respond with “X” as well. Using a single system prompt that is constant across all tasks solves this problem, since the trained model doesn’t get to see the task itself.
One potential follow-up to our paper I’ve been thinking about, which is similar to your idea: teach a model to do various tasks, then teach it to correctly prompt an untrained version of itself to do the tasks. I’d be excited for someone to work on this!
Cool paper, I hadn’t seen it yet!
IIUC, in your paper you try to extract the internal reasoning an LLM does by optimizing a prompt, but you don’t actually get the LLM itself to verbalize its reasoning. This is cool, but it’s not really checking whether the model is able to verbalize its internal processes (i.e. self-identify and describe its reward hacks), right?
I agree that an interesting test would be whether a model that has been trained to do the tricks/tasks would be better at optimizing the prompts. It’s leveraging a similar idea to non-assistant auditing techinques.