Here is a quote from the post describing what to do in that situation:
These automatic control measures only raise the alarm. What happens after the alarm is raised? The developer might also implement autonomous incident response:
Resampling: Force defective M_1 agents to take different actions.
AI agent elimination: Delete suspicious agents in favor of other (hopefully decorrelated) unsuspicious ones.
Fine-tuning: Train agents not to take suspicious actions again.
I’m curious about how fine-tuning away scheming thoughts could still lead to scheming. Here are two possible outcomes:
Fine-tuning actually robustly reduces scheming behavior (as intended) maybe because tokens and actions associated with scheming can be reliably detected and these tokens are required for scheming in the CoT.
Fine-tuning reduces the appearance of scheming but doesn’t really reduce scheming: in this case, the fine-tuning reduces the probability of certain words or trajectories but the underlying ‘concept’ of scheming remains and the model merely rephrases its thinking in a way that is not detected as scheming and uses different tokens.
For scenario 2, maybe you need something like activation engineering to change the underlying thought process rather than just token-level descriptions of it.
Here is a quote from the post describing what to do in that situation:
I’m curious about how fine-tuning away scheming thoughts could still lead to scheming. Here are two possible outcomes:
Fine-tuning actually robustly reduces scheming behavior (as intended) maybe because tokens and actions associated with scheming can be reliably detected and these tokens are required for scheming in the CoT.
Fine-tuning reduces the appearance of scheming but doesn’t really reduce scheming: in this case, the fine-tuning reduces the probability of certain words or trajectories but the underlying ‘concept’ of scheming remains and the model merely rephrases its thinking in a way that is not detected as scheming and uses different tokens.
For scenario 2, maybe you need something like activation engineering to change the underlying thought process rather than just token-level descriptions of it.