An alternative to the blender framing:
Everyone must press a red or blue button. If you press red you will survive. If you press blue you will survive if, and only if, more than 50% of people press blue. Everyone receives the same instructions.
I suspect that with that framing the red vote would increase significantly.
This is an interesting result. Gemini would perhaps approve of Kirk’s approach to the Kobayashi Maru?
One thing which would increase my confidence in the conclusion would be to split the “Low signs of awareness” into the 4 categories of eval. At the moment it isn’t quite clear whether the difference in behaviour in the different eval categories is due to:
Gemini realising it is being evaluated or
Gemini being better aligned for certain eval categories