I think a common point of confusion is that “Untrusted” sounds like it just means “Misaligned” but actually trustedness lives in the map, so an untrusted model might or might not be misaligned in practice.
Currently we (we as in the whole AI safety community) are only doing proxy control evals, which means letting a red team modify the “Untrusted” proxy model (usually by prompting) to create a misaligned version (usually called the attack policy) and leaving the unmodified model as the aligned version. This unmodified model is used to calculate usefulness scores.
This does a couple of things, compared to the situation where usefulness is evaluated against the red team’s model. It takes a bunch of sandbagging options off the table for the red team (which is maybe fair? prompting the attacker to “just never answer correctly”) would give zero usefulness but also just seems kinda unrealistic in the presence of RL.
There’s also issues when thinking about real control evals—those done on an actual untrusted model and protocol before you deploy it—you run into issues with evaluating the capabilities of the model. If the model sandbags your capability evals, your protocol might appear to work when it actually won’t.
I think a common point of confusion is that “Untrusted” sounds like it just means “Misaligned” but actually trustedness lives in the map, so an untrusted model might or might not be misaligned in practice.
Currently we (we as in the whole AI safety community) are only doing proxy control evals, which means letting a red team modify the “Untrusted” proxy model (usually by prompting) to create a misaligned version (usually called the attack policy) and leaving the unmodified model as the aligned version. This unmodified model is used to calculate usefulness scores.
This does a couple of things, compared to the situation where usefulness is evaluated against the red team’s model. It takes a bunch of sandbagging options off the table for the red team (which is maybe fair? prompting the attacker to “just never answer correctly”) would give zero usefulness but also just seems kinda unrealistic in the presence of RL.
There’s also issues when thinking about real control evals—those done on an actual untrusted model and protocol before you deploy it—you run into issues with evaluating the capabilities of the model. If the model sandbags your capability evals, your protocol might appear to work when it actually won’t.