Neel Nanda discussing the “science of misalignment” in a recent video. Timestamp 32:30. Link:
—- tl;dr
Basic science / methodology.
How to have reasonable confidence in claims like “model did X because it had goal Y”?
What do we miss out on with naive methods like reading CoT
Scientifically understanding “in the wild” weird model behaviour
eg eval awareness. Is this driven by deceptiveness?
eg reward hacking. Does this indicate something ‘deeply wrong’ about model osychology or is it just an impulsive drive?
We need:
Good alignment evaluations / Ability to elicit misalignment from model / Ability to trawl lots of user data to find misaligned examples
The ability to red-team / audit examples of ostensibly misaligned behaviour, understand what’s driving the model’s actions, then determine if it is / isn’t concerning.
Neel Nanda discussing the “science of misalignment” in a recent video. Timestamp 32:30. Link:
—- tl;dr
Basic science / methodology.
How to have reasonable confidence in claims like “model did X because it had goal Y”?
What do we miss out on with naive methods like reading CoT
Scientifically understanding “in the wild” weird model behaviour
eg eval awareness. Is this driven by deceptiveness?
eg reward hacking. Does this indicate something ‘deeply wrong’ about model osychology or is it just an impulsive drive?
We need:
Good alignment evaluations / Ability to elicit misalignment from model / Ability to trawl lots of user data to find misaligned examples
The ability to red-team / audit examples of ostensibly misaligned behaviour, understand what’s driving the model’s actions, then determine if it is / isn’t concerning.