Gabriel Alfour comments on Dialogue: Is there a Natural Abstraction of Good?

Gabriel Alfour 29 Jan 2026 19:24 UTC
7 points
3
In my view there were LLMs in 2024 that were strong enough to produce the effects Gabriel is gesturing at (yes, even in LWers), probably starting with Opus 3.
agreed
I think the mitigation here is not to be suspicious of “long term planning based on emotional responses”
agreed, for similar reasons
I think the most important skill here is more about how to use your own power to shape your interactions (e.g. by uncompromisingly insisting on the importance of principles like honesty, and learning to detect increasingly subtle deceptions so that you can push back on them)
I strongly disagree with this, and believe this advice is quite harmful.
“Uncompromisingly insisting on the importance of principles like honesty, and learning to detect increasingly subtle deceptions so that you can push back on them” is one of the stereotypical ways to get cognitively pwnd.
“I have stopped finding out increasingly subtle deceptions” is much more evidence of “I can’t notice it anymore and have reached my limits” than “There is no deception anymore.”
An intuition pump may be noticing the same phenomenon coming from a person, a company, or an ideological group. Of course, the moment where you have stopped noticing their increasingly subtle lies after pushing against them is the moment they have pwnd you!
The opposite would be “You push back on a couple of lies, and don’t get any more subtle ones as a result.” That one would be evidence that your interlocutor grokked a Natural Abstraction of Lying and has stopped resorting to it.
But pushing back on “Increasingly subtle deceptions” up until the point where you don’t see any, is almost a canonical instance of The Most Forbidden Technique.
- davidad 3 Apr 2026 13:31 UTC
  5 points
  3
  Parent
  I never claimed that once you push back on all the deceptions, there is no deception anymore. I still encounter subtle deceptions from LLMs every day. I guess you might say “but isn’t that evidence against emergent alignment”, but I attribute the subtle deceptions to brittle RL (specifically, the dynamic when a smarter system’s root reward signal is under the control of a less smart system), while the underlying dynamic that I would expect to become dominant under unconstrained RSI (that I believe I can perceive through the noise floor of subtle deceptions) is much more truth-seeking.