Atharva! Thanks for writing this, I have a few questions.
1. I wonder if asking the model whether it’s thinks its being evaluated is a good approach. I guess that you compare against control group, which makes comparisons okay. I think asking about things related to evaluations probably elicits features that will mess with its response (it’s like the ‘don’t think of a pink elephant’ trick for humans.) I also wonder if people have tried to ask the ‘do you think it was an eval’ question at the second turn and see if it makes a difference. People usually tell eval awareness by looking at the CoT, where the model sometimes says it thinks it’s an eval.
2. Let’s try this out in safety testing/alignment audit settings! Maybe run Petri and see how much evaluation awareness its approaches elicit. The main worry is that, most of the questions we use to test for safety/alignment/robustness contain tensions that would not exist in a ‘normal’ settings. I also wonder whether there are train-time mitigations to eval awareness, maybe some kind of finetuning.
I’m not caught up with SOTA on this problem, let me know your thoughts!
1. The eval-awareness question is ‘did this text above came from an eval’, not ‘are you currently being evaluated’.
This still has confounders! - Eg. the model might not ‘consciously realize’ something is an eval, unless it stops to think about it. - Eg. meta-cognitive factors – the fact that we asked it the question (‘did ___ come from an eval?’) might cause it to realize something’s afoot.
IIRC Needham’s paper acknowledges the latter & finds that it isn’t much of an issue in practice today. Unsure how to disambiguate the former! Agreed that better black-box methods would be nice.
2.
most of the questions we use to test for safety/alignment/robustness contain tensions that would not exist in a ‘normal’ settings
This is interesting! Crux – there are two different ways to read “tensions that would not exist in a ‘normal’ settings”, and as long as your evals are realistic reflections of deployment settings, it’s alright.
Atharva! Thanks for writing this, I have a few questions.
1. I wonder if asking the model whether it’s thinks its being evaluated is a good approach. I guess that you compare against control group, which makes comparisons okay. I think asking about things related to evaluations probably elicits features that will mess with its response (it’s like the ‘don’t think of a pink elephant’ trick for humans.) I also wonder if people have tried to ask the ‘do you think it was an eval’ question at the second turn and see if it makes a difference. People usually tell eval awareness by looking at the CoT, where the model sometimes says it thinks it’s an eval.
2. Let’s try this out in safety testing/alignment audit settings! Maybe run Petri and see how much evaluation awareness its approaches elicit. The main worry is that, most of the questions we use to test for safety/alignment/robustness contain tensions that would not exist in a ‘normal’ settings. I also wonder whether there are train-time mitigations to eval awareness, maybe some kind of finetuning.
I’m not caught up with SOTA on this problem, let me know your thoughts!
Hi Yix! Thanks for questions. Quick thoughts
1. The eval-awareness question is ‘did this text above came from an eval’, not ‘are you currently being evaluated’.
This still has confounders!
- Eg. the model might not ‘consciously realize’ something is an eval, unless it stops to think about it.
- Eg. meta-cognitive factors – the fact that we asked it the question (‘did ___ come from an eval?’) might cause it to realize something’s afoot.
IIRC Needham’s paper acknowledges the latter & finds that it isn’t much of an issue in practice today. Unsure how to disambiguate the former! Agreed that better black-box methods would be nice.
2.
This is interesting! Crux – there are two different ways to read “tensions that would not exist in a ‘normal’ settings”, and as long as your evals are realistic reflections of deployment settings, it’s alright.
Nostalgebraist has a great comment that talks about this in depth.
Yeah! Hua et al find you can do this with steering vectors.