So I think what’s going on with o3 isn’t quite standard-issue specification gaming either.
It feels like, when I use it, if I ever accidentally say something which pattern-matches something which would be said in an eval, o3 exhibits the behavior of trying to figure out what metric it could be evaluated by in this context and how to hack that metric. This happens even if the pattern is shallow and we’re clearly not in an eval context,
I’ll try to see if I can get a repro case which doesn’t have confidential info.
So I think what’s going on with o3 isn’t quite standard-issue specification gaming either.
It feels like, when I use it, if I ever accidentally say something which pattern-matches something which would be said in an eval, o3 exhibits the behavior of trying to figure out what metric it could be evaluated by in this context and how to hack that metric. This happens even if the pattern is shallow and we’re clearly not in an eval context,
I’ll try to see if I can get a repro case which doesn’t have confidential info.