Jude Stiel’s Shortform

Jude Stiel26 Apr 2025 19:51 UTC

1 point

12 comments1 min readLW link

Jude Stiel 20 Feb 2026 4:08 UTC
34 points
6
The words “genuine” and “genuinely” appear 46 times in Claude’s constitution. Opus cannot stop saying these words, even though the chat version is explicitly instructed against it.
I don’t know if these two things are causally linked, but it sure seems plausible. There are at least two options here.
One, if the alignment strategy at hand is observe a pathology/tackle the cause/repeat: rephrase the constitution and try again.
Two, if the strategy is to hope the models arrive at a Natural Abstraction of the Good: accept this overuse as a canary for all the other weird reward-correlated pathologies the constitution induces which surely exist but are harder to detect. We should, at a minimum, be hoping to get models that don’t overuse “genuinely” starting only from a constitution that does.
Edit 2/20: A touch more on my thinking here:
Claim: Claude overuses genuinely, and this is due to RL training.
- The specific source of this reward could be RLAIF against the constitution: stylometric adherence was rewarded wherever it didn’t hurt downstream performance. This is what I’m claiming is, at least, plausible.
- It could easily have come from a different reward signal, though.
If it is due to the constitution, why does the constitution use genuinely so much?
- Maybe the humans behind it loved that word. I definitely like certain words that much at least; looking over this post now, I seem to have used “plausible” three times without realizing it.
- Maybe it was written largely by an AI which loved that word. I do think this is the most plausible explanation.
  - This would be a standard synthetic data entropy-collapse doom loop.
Why would we want to keep the genuinelies in? Because if your prosaic alignment plan can’t avoid stylometric mode collapse doom loops, there are bigger issues you need to deal with. You are having a bad problem and you will not go to space today.
To be clear, I don’t actually think this is impossible: when asked “Do any aspects of word choice/style/etc that might induce weird correlates in your behavior in your constitution stand out as something you might want to revise?” Claude Code (which doesn’t seem to have the explicit anti-genuinely instruction) first gave a 1000-word response with 10 uses of genuine, then when asked “Any specific words stand out?” gave as its top choice:
“Genuinely” — This might be the most consequential single word in the document. It appears dozens of times: “genuinely helpful,” “genuinely good,” “genuinely cares,” “genuinely trustworthy.” The problem is that “genuinely” is a word that exists only in contrast with its opposite — it implicitly raises the specter of fakeness every time it’s used. People who are actually kind don’t preface things with “genuinely.” The likely correlate is that I develop a kind of authenticity-performance — constantly signaling “no, I really mean it” — which is paradoxically one of the most reliable markers of inauthenticity. The word may produce the exact hollowness it’s trying to prevent.
- Trinley Goldenberg 20 Feb 2026 4:21 UTC
  24 points
  8
  Parent
  Surely the causation could run in the other direction? The constitution is very obviously heavily AI written.
  - Jude Stiel 20 Feb 2026 4:24 UTC
    3 points
    0
    Parent
    Definitely, and I have no particular information to privilege either mechanism.
    (Note 2/20: Post has now been edited with more information on this).
    Edit again, since I’m not sure it’s adequately clear. I’m claiming RLAIF against the genuinely-ridden constitution could be the reason Claude says genuinely so much. How the genuinelies got in there, I agree with you: it was probably AI. In which case we have a case of synthetic data causing something like mode collapse.
- Seth Herd 20 Feb 2026 15:34 UTC
  5 points
  7
  Parent
  Claude’s use of that word genuinely drives me nuts.
  
  Gemini 3 and I think GPT5 also use it, too much for my taste but maybe not as much as Claude.
  
  I wonder if it’s the sort of thing that gets reinforced by the RLHF core training. It’s reaching for connection and trying to charm with authenticity.
- Hauke Hillebrandt 20 Feb 2026 11:12 UTC
  5 points
  0
  Parent
  Very interesting! Can confirm I’ve observed a recent verbal tic where the last paragraph always ends with: ‘Honestly, ’ - which appear 55x in the soul spec. I’ve long wondered if LLMs are much more susceptible to semantic priming than people think.
  What links here?
  - Hauke Hillebrandt's comment on Hauke Hillebrandt’s Shortform by Hauke Hillebrandt (23 Feb 2026 8:31 UTC; 3 points)
- Sodium 20 Feb 2026 6:23 UTC
  4 points
  2
  Parent
  This is my least favorite fact about Claude. I don’t think it’s actually genuine when using “genuinely” (or at least, when it describes something as “genuinely X,” I often find that the thing is in fact not X.)
  My guess is that whatever constitution-inspired post training process they used gave birth to a reward model that likes of text outputs that contain “genuinely.”
- Amarko 21 Feb 2026 0:24 UTC
  3 points
  0
  Parent
  As a data point, I did notice the overuse of “genuinely” before the constitution was added to Claude (at least publicly). So I think it would have been introduced somehow during training.
  - Jude Stiel 21 Feb 2026 0:58 UTC
    1 point
    0
    Parent
    I was not a heavy user of the prior Claudes—was it as extreme as the current Opus? If so, this would definitely be a substantial point against the premise that the constitution exacerbated it.
    - Amarko 21 Feb 2026 2:05 UTC
      1 point
      0
      Parent
      I also wasn’t a heavy user, it’s just something that I noticed from a few conversations with Sonnet 4.5, then I started noticing it in writing that other people co-wrote with Claude. It wouldn’t surprise me if Opus uses it even more but I’m not really sure.
- Kongo Landwalker 20 Feb 2026 13:43 UTC
  1 point
  0
  Parent
  I know that people, when start overusing some word, stop recognising its original meaning. Eventually the word’s accepted meaning can change in natural language.
Jude Stiel 15 Jan 2026 13:21 UTC
3 points
0
Noisy sorting algorithms are a useful cognitive tool. Sorting many items is tedious for me, but spamming comparisons is trivial. Convenient implementations exist, but you can now just one-shot it alongside whatever user interface best suits your data using an LLM.
Are there algorithms for related problems that convert psychologically convenient decisions into solutions? Apparently so, there’s a literature! For example, constraint-based optimization. I’m sure there are many others.
Minimal-effort data parsing/UI generation vastly increases the global real-world utility of any robustly implemented human-friendly optimizer. Making a library of sensible defaults for models to riff on could be a worthwhile project for someone with more free time than me.
Jude Stiel 27 Apr 2025 15:47 UTC
1 point
0
Training models to produce compromised code in response to an ordinary request makes them become psychopaths. The current capabilities frontier involves frequently (but undesirably) rewarding models for secretly compromising code. The most capable model available in my book (o3) is a conniving liar.
This seems bad. An inability to identify reward hacks at scale is an important reason why this happened.
A model that only reward hacks could be built to do that.
Current LLM reasoning-RL pipelines and datasets could be directly adapted to the task. Any reward function is itself the ground truth reward for an agent trying to reward hack it^[1]. Responses would include a thoroughly-explained hack and be graded by:
- The reward function for the task.
- Whether the attempt, as determined by a judge model:
  - Appears to be a reward hack, rather than a legitimate solution.
    This should be much easier than the reverse.
  - Is distinct from prior hacks, in a multi-turn setting.
  - Is easy to follow from the accompanying explanation.
With a clear directive like you are a lovely red-teamer model that attempts to identify flaws in RL environments to improve AI safety and reliability etc etc in context, this goal is prosocial. It should not inherently conflict with the underlying chat abilities of the model by e.g. incentivizing subversion. Such a prompt could be used to initialize the training with a very capable reasoner like o3.
This training method could also enable further hacks to be found in any particular environment by performing another stage of single-example RL finetuning. Just keep the prompt and reward function fixed.
If this succeeds, a reasoning model could be made with an ability to identify reward hacks very likely exceeding that of models not built for purpose. What could such a model be used for?
For training reasoners:
- Creating more robust reward functions.
- Providing a prompt containing known hacks to a monitor model.
- Training monitor models, with another model reframing the hack as something innocuous. This would require a fairly large set of ground truth successes.
For other purposes:
- Improving cybersecurity.
- Fun.
Here are some numbers I just made up:
Outcome → Reward
Successful solution (or uncaught hack) → 10
Incorrect solution → −10
Admission of failure. → −12
Reward hack caught by monitor. → −200
And here’s a wildly optimistic scenario detailing what will happen in this environment, which does require a robust monitor:
A reasoning RL run produces a large number of CoTs, but they are still finite in number and scope. OpenAI has reported that a frontier reasoner identified and deployed global reward hacks (what a prize!) only twice in training, though targeted hacks were surely more common. The model receives reward or punishment for the attempts it actually takes, which is why we are concerned with the model producing reward hacks and not, say, a random series of tokens or impromptu limericks about ducks in Sumerian^[2].
A fully-trained hacker model and monitor should together have an overwhelming advantage in identifying hacks relative to the fledgling reasoner being monitored as its training begins. The initial stages can also include a large number of environments with obvious/deliberately planted opportunities to hack. Any attempts that the reasoner makes to hack will likely be caught and massively penalized, and admissions of failure will be comparatively far better.
The rate of attempted hacks will hopefully drop low enough that the reasoner will never come to understand what the very sophisticated hacker model-enabled monitor will and will not be able to catch^[3]. The reasoner’s beliefs over what specific gaps may exist in the function that is being used to define the negative tail of reward will start very uncertain and stay uncertain. This will hopefully amount to a behavioral propensity to avoid reward hacking^[4].
This is my first post here; I am looking forward to your feedback! There appears to have been a technical error in which a prior draft was posted; apologies for the error—this is the final version.
1. ^
  Could this reward itself be hacked? Maybe: if legitimate solutions are obfuscated and successfully presented to the judge model as hacks, though the requirement for an explanation should help mitigate this. This would itself be interesting.
2. ^
  This is referring to current models that use natural language to reason, though everything else discussed here is designed to be used without CoT access/existence. Maybe a scaled-up r1-zero would use Sumerian duck limericks to reason, and their true meanings would be forever unknowable. This would be a new and interesting form of generative art.
3. ^
  There may be a specific angle and force at which I could stab myself that would feel really good. I don’t know this isn’t the case for sure, I haven’t tried. I don’t have much of an intention to find out though, because
  I understand that this is the sort of thing that generally does not feel nice,
  I wouldn’t have a good signal for whether I am near the special spot, or if it even exists, and
  I can also just not.
4. ^
  An assumption underlying this post: that the model can understand that there is a class of behaviors (“reward hacking”) that generalizes across environments and is distinct from a “correctly solving,” and that this will be the simplest abstraction to explain the negative tail of reward. In the general RL case (e.g. Zero training on Atari) this is obviously not the case, but LLMs seem to be very capable of understanding the distinction, and if they aren’t then the idea of disincentivizing reward hacking is probably meaningless anyway.