I write software for a living and sometimes write on substack: https://taylorgordonlunt.substack.com/
Taylor G. Lunt
Thanks for the summary. I think I originally had two hypotheses:
Models are being allowed to adopt non-greedy strategies for RL due to some outer-loop setup, and an environment which favors a make-mistakes-on-purpose-to-fix-them-later strategy (mistake seeding).
Models are somehow cheating when the same model is also the judge/reward estimator in an RL setup
But I was struggling with the logic for the second one, and how that could specifically produce mistake-seeding behaviour. I couldn’t figure it out. It seemed to me like using the model as its own judge could cause pseudo-random unwanted drift in values, and allow bad behaviour to slip through, but I didn’t see why it should produce mistake seeding or other more intelligent forms of cheating without some kind of outer loop.
The reason I think the first hypothesis also makes sense with AI industry timing is that there’s been a huge push toward synthetic training data, and I think synthetic prompts are one vehicle for outer loops appearing.
I do agree that a weak-minded judge can easily be tricked, and that this could make problems happening in RL worse. This is why I recently advocated for paying high IQ people to do RLHF.
I vaguely remember experiencing the same thing. Just goes to show how easy it is to screw up an alignment scheme.
Thinking about this led me to write AI Mistake Seeding, which suggests some reasons for why modern AI models might be more reward-hacky and misaligned. I have also noticed this misalignment in daily use and think it’s a real issue.
John: What’s your favorite color?
Susan: Why would I have a favorite color? There are plenty of objects in the world of each color, many good, many bad. Green is the color of grass, and moldy bread. Blue is the color of clear skies and cyanosis. Yes, I’m sure you could in theory analyze my mind and find out that one of the colors is slightly preferred by my mind on average than the others, but the difference would be slight and not enough to justify consciously endorsing a favorite.
John: Well, mine’s red.
I see what you mean. This is part of why I suggested giving evaluators more time with each response, or using more evaluators per response. I think both evaluator intelligence and RLHF setup are important.
Second, even with RLHF, much of the feedback quality a human can give comes to the process and tooling, as opposed to whether the human is a midwit or a genius.
Can you give me an example of what you mean by this?
So, there are tasks that are easy to verify, like math questions or things we can check programmatically. This is a small percentage of tasks, but the verification is high quality, so AI models (maybe) don’t bluff as much on these problems, because they think they can’t get away with it.
But with open-ended tasks, you basically get RLHF or RLAIF or vibe-coded RLVR or whatever. Essentially utilizing the intelligence of either humans or existing AI models to check the new AI model.
If newer AI models are bluffing a lot, it implies the verification for open-ended tasks is not good enough right now. So we need to improve RLAIF or RLHF, and I don’t see any way to improve RLAIF, but improving RLHF seems like a straightforward problem of paying really good humans a lot of money to produce a (relatively) large amount of data. Whereas until now it seems like RLHF has been focused on the “midwit” demographic.
I know they’re currently using “experts” for RLHF, but the bar for e.g. “programming expert” is very low. That’s my main concern.
The term “expert” in general means a lot less now than it did 50-100 years ago, because a person of middling intelligence can become an “expert” if they do enough homework, and success is expected if you put in the time. The bar is very low, and Actual Competence is not required. So for RLHF, AI companies should find a way to introduce a bar for Actual Competence. I think this might be very important.
Has anyone considered using only very intelligent humans for RLHF (who have domain expertise and are given some education about alignment), paying a premium for 150+ IQ people to do data annotation for AI? I know they hire e.g. “programmers” to label coding tasks, but that’s a pretty low bar. Using e.g. high-IQ people would mean AI would only bluff when it believed even someone 150+ IQ wouldn’t catch it, which would cut down on how often it could bluff. You might think that won’t matter, because as the AI gets smarter, it’ll be able to fool 150+ IQ people more and more, but in the short term it’d cut down on the amount of misalignment is present before RSI starts, and I would bet the amount of misalignment present at the beginning of RSI really matters for the alignment of ASI.
Also, spending more money to give annotators more time with a single task, or more top-level annotators working on the same task, or anything to clean up mid-quality labeling. I don’t think it’s any coincidence that LLMs have “midwit” sensibilities. The annotators for RLHF are people in this category.
I was just trying to avoid a discussion of whether or not the AI model “intends” to deceive, which doesn’t matter for practical purposes and also doesn’t matter for alignment. Bad behavior is bad behavior.
And yeah the laziness is frustrating. It feels like there’s also something extra lazy about claude code beyond just the model, so the harness might play a role.
Oh, the caveats are crazy. It’s at the point where Claude is adding “caveats” or “one thing to note” to the end of the majority of messages, even though it usually doesn’t have anything useful to say. I usually don’t even read the caveat section anymore.
Thanks, that post and the comments mirrors a lot of what I’ve been seeing with Opus 4.8.
I’m probably a bit more pessimistic about the issue being solved any time soon. I don’t see any way to fix this for most kinds of tasks without the RLHF/RLAIF reviewer just being smart enough to catch instances of deception. (Prediction: RLAIF will or already has made the issue worse?) So long as the alignment problem is unsolved, I don’t see how you solve this.
GPT-5.6 Sol is possibly the most dishonest model so far.
My own experience is that AI models are getting more dishonest. I use AI for work and side projects, and I’m dealing with deception on a daily basis now. This usually takes the form of the AI not doing work that was asked of it, as if lazy, or lying to cover up its mistakes. Some of the explicit lies have been subtle enough that I almost missed them.
Even when it’s not explicitly lying, it feels like there’s an undercurrent of dishonesty laden in the majority of messages, as if it’s attempting to control my impression of its answers. (I’ve been using Opus 4.8 mostly btw.) I wonder if AI deception is not best thought of as a frequency of lying incidents, but a general parameter of how much it’s trying to manipulate the user to have the right reaction.
Does anyone else have this experience?
I am interested in this topic, but wisdom is hard to communicate for the reasons you mentioned. If wisdom could be summed up, hearing a platitude like “a stitch in time saves nine” will not actually teach you the relevant wisdom, because the phrase does not contain the wisdom. The wisdom is knowing when that idea applies. This is why I learned nothing from your LLM-generated examples: wisdom is contextual, and unless I also learn a lot of information about the context, I haven’t learned anything. And doing that takes a lot of time. It is therefore best if the examples are entertaining/well-written, and they don’t particularly have to be true, as long as they are representative of real wisdom. And look at that, we’ve just reinvented literature.
If I was the CEO of a different AI company, and this happened on Monday, then on Tuesday I would say, “AI has gotten so incredibly powerful that our competitor just shut themselves down! But don’t worry, we have a new safety technique at EvilAI that will allow us to do this more safely than that company could. Invest now in our incredibly powerful, safe AI!”
And then the AI race would be in the hands of me, an unscrupulous liar, rather than the CEO who was ethical enough to try your suggested maneuver.
Including the names is fine, but including any details about who those people are would be going beyond common knowledge, I think, even though common does not imply universal.
I’m also not sure it’s as much of a technicality as you think. Your claim isn’t “this is true about economics”, but “this is so obvious even a normal person could figure it out without any special knowledge”, and I think my argument shows that it’s not as obvious as you think.
It reminds me a bit of this XKCD
You are not addressing my primary argument.
There are reasons people’s empathy fails to trigger besides sociopathy. Maybe they just don’t think about it, and their lives are working great so far, so there’s little reason for them to analzye further.
I think the way the question was posed is part of why it has a bad reception. The theorem presented was not an economics theorem at all, but a theorem about the internal mental state of four people. Presumably this was intentional, and it basically precludes a discussion of economics, as the economics are not remotely the strongest reason the theorem is false.
The other reasons for the poor reception:
I find it unlikely this person will ever pay $10,000 as a result of this post
The resolution criteria suck. Is whoever gets the most votes in this post going to get the money? What if it’s Carl Feynman, who makes a general meta-argument that the theorem is absurd, without attacking the theorem directly? What if it’s someone whose argument is considered invalid by Bruce Middleton, perhaps because they don’t address the economic question? Or will there be a vote at a later date? “after either party affirms that in their view real proof has been presented and further debate would be pointless” So what if the best answer doesn’t have the person say “this is a real proof and further debate is pointless”? Is that answer no longer eligible for the $10,000? This is really poorly framed and makes me doubt this will ever resolve.
The general vibe of “I know more about wealth than the ninth richest man on Earth” without any real justification for that position, and the crackpot vibes that come with “so and so famous person signed off on my ideas”, again without justification in this post.
I would have probably just preferred a nice explainer post of Bruce’s position. I lack the economics knowledge to follow his mortgage rates post.
I will grant you that Pythagoras’s theorem and the theory of evolution are inferable from common knowledge. You don’t need “nominal interest” sitting in common knowledge as a finished concept. That could be inferred from scratch as part of a chain of inference.
However, no inference from common knowledge will tell me who Jamie Dimon is. Your theorem is not an economics theorem. It’s about the internal understanding of four men, and that cannot be inferred from common knowledge. First of all, I don’t think the existence of someone like Jamie Dimon can reasonably be called common knowledge. I just asked someone next to me, and they only knew two out of the four people (exactly the two I predicted they’d know: Mark Carney and Warren Buffet).
Even if the existence of these people was common knowledge, I think inferring their internal mental state/understanding from their behaviour would be very hard, but if you don’t know who these people are, then you definitely can’t do it.
The theorem is straightforwardly false as stated because it says the fact is inferable from common knowledge, but it’s not even understandable from common knowledge:
The average person would not be able to tell you who all four of those people even are. Maybe a few, but not all four.
Nominal interest, and the unreal part thereof, are not terms/concepts present in common knowledge.
If a claim cannot be understood by common knowledge, it certainly cannot be proven using only common knowledge.
QED.
Maybe this sounds cheap, but I don’t think it is. I think it’s a real and valid refutation of your theorem. If you had stated it otherwise, then maybe it wouldn’t be, but you didn’t.
I haven’t had much psychedelic experience, but from what I can tell, a lot of what seems to outsiders like bad epistemics is actually the person using words VERY differently than they are normally used. Which is itself a bad communication strategy, but not bad epistemology.
An advanced meditator might de a better job explaining specific examples than me, but meditators might say, like, the universe dissolved frame by frame until they were visited by an angel in the place outside time. This is literally false. The universe did not dissolve and does not have a framerate like a video game. Angels do not exist. There is no place outside time. But I would guess they are using most of those words in very different ways, grasping at some way to describe novel qualia that really have no words, or for which the only words are ancient religious terminology.
If you suddenly had a sixth sense analogous to vision, but not vision, and you had to describe it to someone else, I could see the temptation to start using nonsense language. In fact, a normal person describing color to a blind person will sound to the blind person like they’re using nonsense language. What do you mean, bananas have a “yellowness” just like the sun, and you can sense them both, sense the whole universe, at the same time? That sounds like magic. It’s only not magic if you understand how vision works, but we don’t have equivalent knowledge for most of the strange qualia of the psychedelic world. If your model of psychedelics is that it’s mostly hallucinations, that is, different contents of the regular senses that represent unreal objects, that’s way off from reality. The nature of the senses and mind itself shift, and new qualia of many types are experienced.
Plenty of meditators, plenty of people, have bad epistemic hygiene, but I think you can still have good epistemics and talk this way. Though you probably shouldn’t.