Adrià Garriga-alonso comments on Alignment will happen by default. What’s next?

Adrià Garriga-alonso 26 Nov 2025 5:32 UTC
LW: 4 AF: 1
−6
AF
Thank you for your uncanny knack for honing onto the weak points as always.

They make up quotes they very well know aren’t real.

They know they’re not real on reflection, but not as they’re doing it. It’s more like fumbling and stuttering than strategic deception.

I will agree that making up quotes is literally dishonest but it’s not purposeful deliberate deception.

They comment out tests, and pretend they solve a problem when it’s really obvious they haven’t solved a problem.

I agree this is the model lying, but it’s a very rare behavior with the latest models. It was a problem before labs did the obvious thing of introducing model ratings into the RL assignment process (I’m guessing).

I don’t know how much this really has that much to do what these systems will do when they are superintelligent

Obviously me neither, but my guess is they won’t make up stuff when they know it, and when they don’t know it they’ll be jagged and make up stuff beyond human comprehension, but then fail at stuff that that depends on. More like a capabilities problem.

Or the models that actually work for automating stuff will be entirely different and know their limits.
- habryka 26 Nov 2025 6:49 UTC
  LW: 11 AF: 5
  2
  AF Parent
  They know they’re not real on reflection, but not as they’re doing it. It’s more like fumbling and stuttering than strategic deception.
  I will agree that making up quotes is literally dishonest but it’s not purposeful deliberate deception.
  But the problem is when I ask them “hey, can you find me the source for this quote” they usually double down and cite some made-up source, or they say “oh, upon reflection this quote is maybe not quite real, but the underlying thing is totally true” when like, no, the underlying thing is obviously not true in that case.
  I agree this is the model lying, but it’s a very rare behavior with the latest models.
  I agree that literally commenting out tests is now rare, but other versions of this are still quite common. Semi-routinely when I give AIs tasks that are too hard will they eventually just do some other task that surface level looks like it got the task done, but clearly isn’t doing the real thing (like leaving a function unimplemented, or avoiding doing some important fetch and using stub data). And it’s clearly not the case that the AI doesn’t know that it didn’t do the task, because at that point it might have spent 5+ minutes and 100,000k+ tokens slamming its head against the wall trying to do it, and then at the end it just says “I have implemented the feature! You can see it here. It all works. Here is how I did it...”, and clearly isn’t drawing attention to how it clearly cut corners after slamming its head against the wall for 5+ minutes.
  What links here?
  - Noosphere89's comment on Jacob Pfau’s Shortform by Jacob Pfau (26 Nov 2025 17:06 UTC; 2 points)