The “usefulness” reported in papers is how much work can be gotten out of an aligned model.
Hmm, my impression was that this was just a design choice to make it easier to do to iterate on setups with small modes, and that it would be changed later on. In the Ctrl-z paper, they use the “untrusted model” with honest prompting -
We normalize our usefulness numbers such that UH gets 100% usefulness and T gets 0% [UH is GPT-4o mini acting honestly]
In practice 4o-mini is ~aligned/trusted, but I imagine that future papers would need to improve on this methodology.
Though its a good question how to do this, e.g. maybe we need to set up some sort of blind test with a tampered model similar to the Hidden Objectives paper, do elicitations somehow to reach close to the “actual” UH which would be the untampered model. But that might not generalize to more capable models (?)
Would be curious for links to Redwoods justifications if you have them
I think a common point of confusion is that “Untrusted” sounds like it just means “Misaligned” but actually trustedness lives in the map, so an untrusted model might or might not be misaligned in practice.
Currently we (we as in the whole AI safety community) are only doing proxy control evals, which means letting a red team modify the “Untrusted” proxy model (usually by prompting) to create a misaligned version (usually called the attack policy) and leaving the unmodified model as the aligned version. This unmodified model is used to calculate usefulness scores.
This does a couple of things, compared to the situation where usefulness is evaluated against the red team’s model. It takes a bunch of sandbagging options off the table for the red team (which is maybe fair? prompting the attacker to “just never answer correctly”) would give zero usefulness but also just seems kinda unrealistic in the presence of RL.
There’s also issues when thinking about real control evals—those done on an actual untrusted model and protocol before you deploy it—you run into issues with evaluating the capabilities of the model. If the model sandbags your capability evals, your protocol might appear to work when it actually won’t.
Hmm, my impression was that this was just a design choice to make it easier to do to iterate on setups with small modes, and that it would be changed later on. In the Ctrl-z paper, they use the “untrusted model” with honest prompting -
In practice 4o-mini is ~aligned/trusted, but I imagine that future papers would need to improve on this methodology.
Though its a good question how to do this, e.g. maybe we need to set up some sort of blind test with a tampered model similar to the Hidden Objectives paper, do elicitations somehow to reach close to the “actual” UH which would be the untampered model. But that might not generalize to more capable models (?)
Would be curious for links to Redwoods justifications if you have them
I think a common point of confusion is that “Untrusted” sounds like it just means “Misaligned” but actually trustedness lives in the map, so an untrusted model might or might not be misaligned in practice.
Currently we (we as in the whole AI safety community) are only doing proxy control evals, which means letting a red team modify the “Untrusted” proxy model (usually by prompting) to create a misaligned version (usually called the attack policy) and leaving the unmodified model as the aligned version. This unmodified model is used to calculate usefulness scores.
This does a couple of things, compared to the situation where usefulness is evaluated against the red team’s model. It takes a bunch of sandbagging options off the table for the red team (which is maybe fair? prompting the attacker to “just never answer correctly”) would give zero usefulness but also just seems kinda unrealistic in the presence of RL.
There’s also issues when thinking about real control evals—those done on an actual untrusted model and protocol before you deploy it—you run into issues with evaluating the capabilities of the model. If the model sandbags your capability evals, your protocol might appear to work when it actually won’t.
Responded to a bit of this in a separate thread here