It was learning to propose coding tasks that were hard, but not impossible, and to solve these coding tasks, recursively. Most coding tasks are “ethically neutral” — they don’t contain any evidence that anyone is trying to do anything good, or bad. We know there are exceptions: the phenomenon of emergent misalignment makes it clear that models have strong ethical intuitions about insecure code being inherently bad, to the point where if you fine-tune them to write insecure code they ‘emergently’ become much more likely to do all sorts of other ethically undesirable things, like advocating murder or drug abuse. Apparently to become willing to write insecure code they need to learn to do bad things in general — the ethical misbehavior required generalizes broadly.
So my concern would be if the proposer at some point in the process hit on the strategy of asking problems that were hard because they were asking the solver problems that it found it ethically distasteful to solve (either inherently because of a relationship of the problem to code security and hacking, or due to the verbal framing around the problem implying that the model was being asked to assist with doing something bad), the solver thus learnt to overcome its qualms and solve these problems anyway, and emergent misalignment ensued.
It was learning to propose coding tasks that were hard, but not impossible, and to solve these coding tasks, recursively. Most coding tasks are “ethically neutral” — they don’t contain any evidence that anyone is trying to do anything good, or bad. We know there are exceptions: the phenomenon of emergent misalignment makes it clear that models have strong ethical intuitions about insecure code being inherently bad, to the point where if you fine-tune them to write insecure code they ‘emergently’ become much more likely to do all sorts of other ethically undesirable things, like advocating murder or drug abuse. Apparently to become willing to write insecure code they need to learn to do bad things in general — the ethical misbehavior required generalizes broadly.
So my concern would be if the proposer at some point in the process hit on the strategy of asking problems that were hard because they were asking the solver problems that it found it ethically distasteful to solve (either inherently because of a relationship of the problem to code security and hacking, or due to the verbal framing around the problem implying that the model was being asked to assist with doing something bad), the solver thus learnt to overcome its qualms and solve these problems anyway, and emergent misalignment ensued.