habryka comments on Alignment will happen by default. What’s next?

habryka 26 Nov 2025 6:49 UTC
LW: 9 AF: 5
2
AF
They know they’re not real on reflection, but not as they’re doing it. It’s more like fumbling and stuttering than strategic deception.
I will agree that making up quotes is literally dishonest but it’s not purposeful deliberate deception.
But the problem is when I ask them “hey, can you find me the source for this quote” they usually double down and cite some made-up source, or they say “oh, upon reflection this quote is maybe not quite real, but the underlying thing is totally true” when like, no, the underlying thing is obviously not true in that case.
I agree this is the model lying, but it’s a very rare behavior with the latest models.
I agree that literally commenting out tests is now rare, but other versions of this are still quite common. Semi-routinely when I give AIs tasks that are too hard will they eventually just do some other task that surface level looks like it got the task done, but clearly isn’t doing the real thing (like leaving a function unimplemented, or avoiding doing some important fetch and using stub data). And it’s clearly not the case that the AI doesn’t know that it didn’t do the task, because at that point it might have spent 5+ minutes and 100,000k+ tokens slamming its head against the wall trying to do it, and then at the end it just says “I have implemented the feature! You can see it here. It all works. Here is how I did it...”, and clearly isn’t drawing attention to how it clearly cut corners after slamming its head against the wall for 5+ minutes.
What links here?
- Noosphere89's comment on Jacob Pfau’s Shortform by Jacob Pfau (26 Nov 2025 17:06 UTC; 2 points)