Situational awareness becomes a feature not a bug.
Today we try to test AI whether AIs will obey instructions in often-implausible hypothetical scenarios. But as AIs get more intelligent, trying to hide their actual situation from them will become harder and harder. Yet this doesn’t have to just be a disadvantage, but rather also something we can benefit from. Virtues are inherently context-dependent and require judgment about how to apply them (unlike rules or obedience). There are also many things that we should want AIs to do only in situations where they are confident about which situation they’re in.
I’m not clear on how the solves the problem in the contexts it’s load bearing in. For example, in the case of dangerous inputs we would like to not actually be giving the model dangerous inputs, but have the model believe this is the case (ex: various forms of honeypots). Crucially, in these situations you presumably don’t know whether you’ve robustly virtue aligned your model, so I don’t think it’s obvious that in these cases increased situational awareness about being in a honeypot is now good.
Yes, this was badly phrased. I have edited it to read “becomes a feature not just a challenge”. I agree this doesn’t solve the core problem of wanting to set honeypots.
I’m not clear on how the solves the problem in the contexts it’s load bearing in. For example, in the case of dangerous inputs we would like to not actually be giving the model dangerous inputs, but have the model believe this is the case (ex: various forms of honeypots). Crucially, in these situations you presumably don’t know whether you’ve robustly virtue aligned your model, so I don’t think it’s obvious that in these cases increased situational awareness about being in a honeypot is now good.
Yes, this was badly phrased. I have edited it to read “becomes a feature not just a challenge”. I agree this doesn’t solve the core problem of wanting to set honeypots.