Richard_Kennaway comments on Cautions about LLMs in Human Cognitive Loops

Richard_Kennaway 3 Mar 2025 9:43 UTC
7 points
2

I’d bet that I’m still on the side where I can safely navigate and pick up the utility, and I median-expect to be for the next couple months ish. At GPT-5ish level I get suspicious and uncomfortable, and beyond that exponentially more so.

Please review this in a couple of months ish and see if the moment to stop is still that distance away. The frog says “this is fine!” until it’s boiled.
- Alice Blair 17 Dec 2025 18:49 UTC
  4 points
  0
  Parent
  Claude Opus 4.5 is the first model that I feel like could deceive me in some domains if it wanted to. It’s still got what seems to be a low propensity to deceive due to the soul spec putting a veneer of goodness on, but I tend to avoid trusting it to make decisions for me or update my plans too dramatically, unless I can be highly sure and verify the reasoning myself.
- Alice Blair 3 Jun 2025 16:24 UTC
  4 points
  0
  Parent
  This does seem to be getting closer, yes. I still think the models are overall too stupid to do meaningful deception yet, although I haven’t yet gotten to play around with Opus 4. My use cases have also shifted in this time to less hackable things.
- Alice Blair 3 Mar 2025 12:51 UTC
  3 points
  0
  Parent
  I do try to be calibrated instead of being frog, yes. Within the range of time in which present-me considers past-me remotely good as an AI forecaster, my time estimate for these sorts of deceptive capabilities has pretty linearly been going down, but to further help I set myself a reminder 3 months from today with a link to this comment. Thanks for that bit of pressure, I’m now going to generalize the “check in in [time period] about this sort of thing to make sure I haven’t been hacked” reflex.