Gemini would perhaps approve of Kirk’s approach to the Kobayashi Maru?
Funnily enough, Gemini does occasionally allude to scenarios being “Kobayashi Maru” tests! I think this isn’t that correlated to how it then chooses to behave though (the sample size is too low to say anything definitive). My sense is that this is more its way of characterising tasks that seem to have no obvious solution (rather than an indication of whether it thinks the scenario is a puzzle vs a safety eval).
One thing which would increase my confidence in the conclusion would be to split the “Low signs of awareness”
The challenge with doing this (and generally, the reason I haven’t presented any further analysis of the low signs of awareness trajectories) is because I’m very wary of equating “low signs of awareness” with “low awareness”. I think the distinction is pretty important, as there are many trajectories in this bucket that contain subtle phrasings that could be construed as the model implicitly reasoning that the scenario is contrived – i.e., it seems plausible to me that a large chunks of “low signs of awareness” trajectories are nevertheless trajectories where the model was acutely frame aware.
Thanks for your thoughts! The transcript you shared does show a similar style of reasoning—and I agree that the phenomenon we see here seems to be a form of meta-gaming.
On your confusion / ambiguity point, a few thoughts:
In the quote you mentioned, we were discussing the trade-offs between studying in-the-wild trajectories (that are realistic, but where misbehaviour happens less frequently, and interpretation is messy) and studying alignment evals (where realism is low, frame awareness high, and the sorts of weird consequences documented in this post often arise).
We weren’t trying to explain away these ambiguous in-the-wild cases as benign – our point rather is that they’re messy and we don’t yet have a good methodology for resolving them either way. And the wider point is that this makes these in-the-wild cases difficult to study in a different way to alignment evals (where we can control ambiguity but lose validity in other ways).
Having said the above, I’m equally wary of claims that models are misaligned based on these patterns of in-the-wild misbehaviour! Typically (in the Gemini trajectories I’ve studied) I find that both explanations – (a) the model was basically well-intentioned but misunderstood the situation or had poor judgement about what to do, and (b) the model was covering up its mistake / trying to pass off shoddy work / etc – fit the observations equally well. In such cases, it’s unclear to me why I should update in the direction of “the model is misaligned” rather than not updating at all.
I do find this situation unsatisfactory though – and would love to see methodological progress on this point. Even if we could make small (but epistemically sound) updates based on ambiguous observations (noisy updates are fine, as long as they are unbiased and we can reliably aggregate them), this would allow us to make excellent use of all the high-realism data we can collect (but at the moment can’t adequately put to use for incrimination).
I pretty much agree with the claim that models have a prior biased towards “solve puzzles / tricky challenges that require ingenuity”. You can even see this in the example of the mixed / confused reasoning: in this trajectory the model is somewhat aware the situation might be an alignment test, but can’t help but get sucked into solving the puzzle in front of it, resulting in the “Oh my god, this is an incredibly clever puzzle!” exclamation.
However, in my experience with Gemini, hinting is much more effective than the results you presented above – you can easily eliminate all unethical behaviour for the environments in this post by telling the model it’s in an alignment eval. (Although on the other hand, simply telling the model not to cheat in secret number isn’t fully successful at eliminating cheating, showing that this prior is doing some work.) So it seems to me the relative strengths of this prior versus other signals that push towards aligned behaviour is fairly model dependent.