The “hallucination/reliability” vs “misaligned lies” distinction probably matters here. The former should in principle go away as capability/intelligence scales while the latter probably gets worse?
I don’t know of a good way to find evidence of model ‘intent’ for this type of incrimination, but if we explain this behavior with the training process it’d probably look something like:
Tiny bits of poorly labelled/bad preference data makes its way into training dataset due to human error. Maybe specific cases where the LLM made up a good looking answer and the the human judge didn’t notice.
The model knows that the above behavior is bad, but gets rewarded anyways, this leads to some amount of misalignment/emergent misalignment. Even though in theory, the fraction of bad training data should be no where near sufficient for EM
Generalization seems to scale with capabilities.
Maybe the scaling law to look at here is model size vs. the % of misaligned data needed for the LLMs to learn this kind of misalignment? Or maybe inoculation prompting fixes all of this, but you’d have to craft custom data for each undesired trait...
The “hallucination/reliability” vs “misaligned lies” distinction probably matters here. The former should in principle go away as capability/intelligence scales while the latter probably gets worse?
I don’t know of a good way to find evidence of model ‘intent’ for this type of incrimination, but if we explain this behavior with the training process it’d probably look something like:
Tiny bits of poorly labelled/bad preference data makes its way into training dataset due to human error. Maybe specific cases where the LLM made up a good looking answer and the the human judge didn’t notice.
The model knows that the above behavior is bad, but gets rewarded anyways, this leads to some amount of misalignment/emergent misalignment. Even though in theory, the fraction of bad training data should be no where near sufficient for EM
Generalization seems to scale with capabilities.
Maybe the scaling law to look at here is model size vs. the % of misaligned data needed for the LLMs to learn this kind of misalignment? Or maybe inoculation prompting fixes all of this, but you’d have to craft custom data for each undesired trait...