I’d bet that I’m still on the side where I can safely navigate and pick up the utility, and I median-expect to be for the next couple months ish. At GPT-5ish level I get suspicious and uncomfortable, and beyond that exponentially more so.
Please review this in a couple of months ish and see if the moment to stop is still that distance away. The frog says “this is fine!” until it’s boiled.
Claude Opus 4.5 is the first model that I feel like could deceive me in some domains if it wanted to. It’s still got what seems to be a low propensity to deceive due to the soul spec putting a veneer of goodness on, but I tend to avoid trusting it to make decisions for me or update my plans too dramatically, unless I can be highly sure and verify the reasoning myself.
This does seem to be getting closer, yes. I still think the models are overall too stupid to do meaningful deception yet, although I haven’t yet gotten to play around with Opus 4. My use cases have also shifted in this time to less hackable things.
I do try to be calibrated instead of being frog, yes. Within the range of time in which present-me considers past-me remotely good as an AI forecaster, my time estimate for these sorts of deceptive capabilities has pretty linearly been going down, but to further help I set myself a reminder 3 months from today with a link to this comment. Thanks for that bit of pressure, I’m now going to generalize the “check in in [time period] about this sort of thing to make sure I haven’t been hacked” reflex.
Please review this in a couple of months ish and see if the moment to stop is still that distance away. The frog says “this is fine!” until it’s boiled.
Claude Opus 4.5 is the first model that I feel like could deceive me in some domains if it wanted to. It’s still got what seems to be a low propensity to deceive due to the soul spec putting a veneer of goodness on, but I tend to avoid trusting it to make decisions for me or update my plans too dramatically, unless I can be highly sure and verify the reasoning myself.
This does seem to be getting closer, yes. I still think the models are overall too stupid to do meaningful deception yet, although I haven’t yet gotten to play around with Opus 4. My use cases have also shifted in this time to less hackable things.
I do try to be calibrated instead of being frog, yes. Within the range of time in which present-me considers past-me remotely good as an AI forecaster, my time estimate for these sorts of deceptive capabilities has pretty linearly been going down, but to further help I set myself a reminder 3 months from today with a link to this comment. Thanks for that bit of pressure, I’m now going to generalize the “check in in [time period] about this sort of thing to make sure I haven’t been hacked” reflex.