One of the major problems with this atm is that most ‘alignment’, ‘safety’, etc evals dont specify or define exactly what they’re trying to measure.
so for this and other reasons, its hard to say when an eval has been truly successfully ‘red teamed’
One of the major problems with this atm is that most ‘alignment’, ‘safety’, etc evals dont specify or define exactly what they’re trying to measure.
so for this and other reasons, its hard to say when an eval has been truly successfully ‘red teamed’