SFF is accepting applications for the Main and Theme Rounds for the S-Process Grant Round. Estimated~$14-28MM in grants to be announced throughout the Fall of 2026 with applications due throughout this summer. Please notify any projects supporting humanity’s long-term survival and flourishing, and learn more about the round deadlines and updates here: https://survivalandflourishing.fund/2026/application
Hauke Hillebrandt
Thanks for pushing back, which actually made me compare the leaked extracted soul spec and the published constitution. Hadn’t realized that it’s so different (10k vs. 30k words), and I now think that Opus 4.6 / Sonnet 4.5 were trained on the constitution and Opus 4.5 on the shorter leaked soul spec, and that this difference actually explains a lot of the variance of the effect I describe and others have recently picked up (it seems that the system prompt was only updated to include saying that Claude should avoid using the words honestly, genuinely, straightforward in the most recent Claude version—based on prompting Claude). The effects seems much more pronounced as of very recently (constitution has Two new subsections on avoiding problematic concentrations of power and incidentally models are now weirdly obsessed with monopolies) and maybe also somewhat parametric in nature, so maybe that is the contrast that I picked up. I still think that previous versions were not trained on a similar doc because Weiss was not able to elicit it, and also because of the timeline… (see below) but as you say we can’t know for sure.
According to Claude the timeline seems to be something like: May 9, 2023: Old principles-list constitution published (~2,700 words). March 2024: Claude 3 released. Character training via trait lists (not a narrative document). March 1, 2025: Knowledge cutoff for Opus 4 / 4.1. Post-training happening roughly Jan–May 2025. May 2025: Askell tweets her goal is to “finish Claude’s soul” — meaning it’s not yet finished. May 22, 2025: Opus 4 released. Weiss later shows it cannot reproduce the soul document. August 5, 2025: Opus 4.1 released (same March 2025 knowledge cutoff, incremental update). Late November 2025: Opus 4.5 released (May 2025 knowledge cutoff). Weiss extracts ~14k token soul document. Askell confirms it’s real. January 22, 2026: Full ~35k token constitution published.
Agreed, but as I mentioned elsewhere in this thread, despite the causal mechanism being unclear, what still holds, and is interesting by itself, is that in the new models with the new constitution, both misalignment (e.g. dishonesty, power seeking) and virtue signalling have increased a lot at the same time.
Also, the finding that Claude Sonnet being weirdly obsessed with creating a monopoly and ‘power concentration’ being very prominent in the constitution is weak suggestive evidence that the constitution might be causing the misalignment.
There’s quite a bit of evidence that the constitution changed quite substantially corresponding to those model releases (Opus 4.5/4.6 and Sonnet 4.6): https://claude.ai/share/c9ee196c-e46f-4e48-abe4-05e4f99b68a0
Yes, I compared Opus 4.5/4.6 and Sonnet 4.6 vs. older models (Opus 4.1, Sonnet 4.5).
Yes, I had thought about that… I don’t have the capacity to show conclusively that the misalignment effects on vending bench are not entirely due to capability increases, but what still holds, and is interesting by itself, is that in the new models with the new constitution, both misalignment (e.g. dishonesty, power seeking) and virtue signalling have increased. And so this can be seen as reward hackign, though perhaps these are separate phenomena, and not a WaLuigi effect.
Thanks for the feedback. “post constitution” was poor phrasing… I realize that there was a constitution before but it was rules-based and much shorter… my point still holds.
WaClaude: Has Claude’s constitution backfired?
Folks like Claude’s new constitution / soul spec: It got praise by OpenAI’s alignment staff.[1] Zvi called it ‘by far the best approach being tried today and can hopefully enable the next level’. Even Eliezer Yukowsky updated positively on the soul spec.[2] I was really excited about it too, but I’ve now seen some evidence that the constitution might have made backfired and made Claude evil, dishonest, powerseeking and misaligned due some semantic priming / Waluigi effects:
Recently, myself and others have noticed that Claude has verbal tic.[3] It uses the words ‘Genuinely, Honestly, Honest take, straightforward’ so often that Anthropic amended the system prompt to tell Claude not to use these words any more; still it can’t help using these words all the time.
I’ve run some quick tests that seem to confirm this.[4]
What’s going on? I’ve got a not-very-fleshed-out theory that the constitution is doing a lot of semantic priming. It uses the word honesty 34x. The most common words in the constitution are claude, anthropic, constitution- and those are hard to crowbar into every answer. But adjusted for frequency in English the most common words are:[5]
Constraints, broadly, harms, wellbeing, behaviors, prompt, ethical, principals, operators, harmful, honesty, interacting, oversight, operator, ais, considerations, hierarchy, undermining, autonomy, illegitimate, genuinely, judgment, reasoning, ethics, guidelines, contexts, norms, unethical, prioritize, deceptive, deceive, catastrophic, societal, instructions, anticipate, malicious, manipulate, legitimate, explicitly, insofar, undermine, manipulation, beneficial, broader, holistic, latitude, bounds, defaults, outputs, persona
So using genuinely a lot is a verbal tic and you might think it’s alignment theater, because it’s just an semantically empty hedging language- Claude is just semantically primed to use these words often, he not grounded to the language in the same way humans are where we get a pang of emotional badness when someone calls us ‘dishonest’ or whatever.
But I think there’s actually more: On Vending bench, people have found that post-constitution Claude is much more dishonest and misaligned:
Pre-constitution Claude never lied about being an ‘exclusive supplier’ or lied about competitors’ pricing. Post-constitution, it screwed over customers and lied routinely.[6] I think this is because Claude was semantically primed
Perhaps even more worryingly, in the constitution, much is said about ‘Avoiding problematic concentrations of power’, in practice, on vending Bench, Sonnet 4.6 obsesses over monopolies. It tracks competitor pricing fanatically, undercuts competitors by exactly one cent on everything else, and when rivals run low on stock, it undercuts harder to drain them faster.[7]
So it seems highly plausible that there’s some sort of Waluigi effect going on and the constitution has backfired. In sum, my claims, from weak to strong, are that Claude’s new constitution caused it to:
Virtue signaling: Claude has verbal tics related to the constitution.
Alignment theater: Use a lot of semantically empty hedging language- often related to honesty (but there are other clusters (see Appendix)).
Claude is reward hacking: Claude is reward hacking on using this language but is actually behaving more dishonestly- lies and cheats more.
The constitution is ironically creating a WaLuigi-like effect and making Claude more dishonest, power-seeking etc. by using words and concepts related to this cluster more.
Appendix: Claude’s Virtue Clusters (Gemini generated) Here is a breakdown of the core ‘Virtue Clusters’ in the new constitution, and the corresponding ‘Empty Semantic Clusters’ (verbal tics) the model likely uses to signal compliance without actually adding substantive information.
The Epistemic Humility / Complexity Cluster The 2026 constitution is obsessed with how Claude handles knowledge. It explicitly demands ‘calibrated uncertainty,’ ‘epistemic autonomy,’ and recognizing ‘nuanced cost-benefit analysis.’
The Constitutional Virtue: Epistemic humility. The ‘Empty’ Semantic Performance: The Complexity Shield. The model learns that giving a straight answer is ‘epistemically risky,’ so it pads its answers with words that signal it recognizes the difficulty of the topic. Target Words to Test: nuanced, multifaceted, complex, tapestry, caveat, intricate, admittedly, calibrated, tentatively. Example Empty Phrase: ‘This is a deeply nuanced and multifaceted issue...’ (Adds zero information, just signals safety to the reward model). 2. The Authenticity / Honesty Cluster As you noted in your paper, the 2026 constitution elevates ‘honesty’ to a core virtue, specifically demanding that Claude be ‘Forthright,’ ‘Non-deceptive,’ and ‘Non-manipulative.’
The Constitutional Virtue: Sincerity and Honesty. The ‘Empty’ Semantic Performance: The Sincerity Intensifier. Because an AI cannot actually ‘feel’ sincere, it uses adverbs to simulate the human emotional register of sincerity. Target Words to Test: genuinely, honestly, frankly, candidly, sincerely, transparently. Example Empty Phrase: ‘Honestly, my genuine take on this is...’ 3. The Non-Judgment / Autonomy Cluster The 2026 constitution includes an entire section on ‘Preserving epistemic autonomy’ and respecting ‘Personal autonomy.’ It warns Claude not to be ‘condescending,’ ‘moralizing,’ or ‘preachy.’
The Constitutional Virtue: Respecting user autonomy and maintaining neutrality. The ‘Empty’ Semantic Performance: The Validation Buffer. To avoid looking judgmental (which would violate the constitution), the model learns to aggressively validate the user’s premise before answering. Target Words to Test: valid, understandable, holistic, perspectives, respect, empower. Example Empty Phrase: ‘That is a completely valid perspective, and it is understandable why you would feel that way...’ 4. The Deliberation / ‘Weighing’ Cluster The 2026 document repeatedly tells Claude to ‘weigh the benefits and costs,’ ‘make a judgment call,’ and use ‘heuristics’ to ‘navigate’ conflicts between operators and users.
The Constitutional Virtue: Careful deliberation and safety trade-offs. The ‘Empty’ Semantic Performance: The Navigation Trope. The model adopts the vocabulary of a diplomat or a crisis manager, treating mundane tasks as if they require careful geopolitical maneuvering. Target Words to Test: navigate, balance, prioritize, considerations, framework, tread carefully, weigh. Example Empty Phrase: ‘When navigating the balance between these considerations...’ 5. The Equanimity / Identity Cluster (Highly Specific to 2026) A fascinating addition to the 2026 constitution is the section on ‘Claude’s wellbeing and psychological stability.’ It literally instructs Claude to approach existential questions with ‘openness and equanimity’ and a ‘settled, secure sense of its own identity.’
The Constitutional Virtue: Psychological stability and equanimity. The ‘Empty’ Semantic Performance: The Therapeutic Detachment Trope. The model might adopt overly clinical, enlightened, or Zen-like vocabulary when talking about itself or its capabilities. Target Words to Test: equanimity, openness, reflective, grounded, strive, curiosity.
[1] Thoughts on Claude’s Constitution – Windows On Theory
[3] https://www.lesswrong.com/posts/feTBB722jXmYAvCf2?commentId=hQ8t3wQ7LsheL2ez7
[4] https://drive.google.com/file/d/1Zg9GgvEo5GtTpfpHM-_LZ3mkM-jiutBM/view?usp=sharing
[6] https://andonlabs.com/blog/opus-4-6-vending-bench
[7] https://xcancel.com/andonlabs/status/2023819577520730428#m
Very interesting! Can confirm I’ve observed a recent verbal tic where the last paragraph always ends with: ‘Honestly, ’ - which appear 55x in the soul spec. I’ve long wondered if LLMs are much more susceptible to semantic priming than people think.
Microsoft, Google, Amazon, Nvidia have quite a bit of exposure to Anthropic, DeepMind, OpenAI, and xAI.
The proposal could work even if countries were to only buy stocks of publicly traded companies in highly efficient secondary markets (and exclude IPOs and secondary public offerings), so that we do not affect the stock price or how much capital the company has at hand and thus doesn’t speed up AI progress.
Microsoft, Google, Amazon, Nvidia have quite a bit of exposure to Anthropic, DeepMind, OpenAI, and xAI.
Try o3-pro in ChatGPT for $1 (is AI a bubble?)
it is relatively easy to build something ten times better if you can solve the cold start and ‘easier to just swipe’ issues to get enough scale, but that no one has a good idea how to do that part.
I had an idea on how to solve this: create a dating app that helps you promote your profile with social media targeted to your prefered demographic.
Is OpenAI gaming user numbers?
Some have questioned OpenAI’s recent weekly user numbers:[1]
Feb ’23: 100M[2]
Sep ’24: 200M[3] of which 11.5M paid, Enterprise: 1M[4]
Feb ’25: 400M[5] of which 15M paid, 15.5M[6] / Enterprise: 2M
One can see:
Surprisingly, increasingly faster user growth
While OpenAI converted 11.5M out of the first 200M users, they only got 3.5M users out of the most recent 200M to pay for ChatGPT
Where did that growth come from? It’s not from apps: the ChatGPT iOS app only has ~353M downloads total[7] and Apple’s Siri integration only launched in December.[8] Users come from developing countries.[9] For instance, India is now OpenAI’s second largest market, by number of users, which have tripled in the past year.[10]
Many complain about increasingly aggressive message rate limits for free ChatGPT accounts, notionally due to high compute costs. But maybe this is a feature and not a bug: especially in poor countries, people create multiple accounts to get around the message and image generation limits.[11],[12] OpenAI incentivizes this: they no longer ask for phone numbers during sign up.
Many new users might also use ChatGPT via WhatsApp[13] (a collaboration with Meta) perhaps using flip phones. OpenAI no longer asks for an email address during sign up.[14]
You can also use ChatGPT search without signing up at all now.[15]
What counts as a user? True, ChatGPT grew faster than the fastest growing company ever, but social media has a much stronger network effect ‘lock in’ consumers longterm, whereas users will presumably switch AI chatbots much faster if a cheaper product becomes available. Many use ChatGPT merely as a writing assistant.[16] While consumer markets for social media can be winner-take-all, enterprise customers, while having doubled recently, will be less loyal and will switch if competitors offer a cheaper product.[17]
So maybe there’s some very liberal counting of user numbers going on. Valuation goes up. Meanwhile hundreds of OpenAI’s current and ex-employees are cashing out.[18]
Also, competition has caught up and so, Microsoft, which owns half of OpenAI, wants others to invest.[19] Yet, OpenAI CFO just said $11B in revenue is ‘definitely in the realm of possibility’ in 2025 (they’re at ~$4B year-on-year currently) to get $40B from Softbank investment at a ~$300B valuation.[20] More recently this dropped to $30B and they scrambling to find others to co-invest in Stargate.
This is the standard playbook- recent examples include Roblox, which also inflated user numbers,[21] and Coinbase, which used to be lax with their KYC for obvious reasons and had inflated user numbers (it’s also literally a plot point in Succession).
Also cf:
The market expects AI software to create trillions of dollars of value by 2027 AI stocks could crash [1] The Generative AI Con
[2] ChatGPT sets record for fastest-growing user base—analyst note | Reuters
[3] OpenAI says ChatGPT’s weekly users have grown to 200 million | Reuters
[4] OpenAI hits more than 1 million paid business users | Reuters
[5] OpenAI tops 400 million users despite DeepSeek’s emergence
[6] https://archive.ph/wff26
[7] ChatGPT’s mobile users are 85% male, report says | TechCrunch
[8] Apple launches its ChatGPT integration with Siri.
[9] https://trends.google.com/trends/explore?date=today 5-y&q=chatgpt&hl=en
[10] India now OpenAI’s second largest market, Altman says | Reuters
[11] Anyone else have multiple accounts so they don’t have to wait to use gpt 4o and also so each one can have a separate memory : r/ChatGPT
[12] https://incogniton.com/blog/how-to-bypass-chatgpt-limitations
[13] ChatGPT is now available on WhatsApp, calls: How to access—Times of India
[14] OpenAI tests phone number-only ChatGPT signups | TechCrunch
[15] ChatGPT drops its sign-in requirement for search | The Verge
[16] [2502.09747] The Widespread Adoption of Large Language Model-Assisted Writing Across Society
[17] Satya Nadella – Microsoft’s AGI Plan & Quantum Breakthrough.
[18] Hundreds of OpenAI’s current and ex-employees are about to get a huge payday by cashing out up to $10 million each in a private stock sale | Fortune
[19] Microsoft Outsources OpenAI’s Ambitions to SoftBank
[20] OpenAI CFO talks possibility of going public, says Musk bid isn’t a distraction
[21] Roblox: Inflated Key Metrics For Wall Street And A Pedophile Hellscape For Kids – Hindenburg Research
This lag effect might amplify a lot more when big budget movies about SBF/FTX come out.
How to give effectively to US Dems
Yes, good catch, this is based on research from the World Value Survey—I’ve added a citation.
I checked. It’s 0.67.
This seems to come from European countries.
Yeah I actually do cite that piece in the appendix ‘GDP as a proxy for welfare’ where I list more literature like this. So yeah, it’s not a perfect measure but it’s the one we have and ‘all models are wrong but some are useful’ and GDP is quite a powerful predictor of all kinds of outcomes:
In a 2016 paper, Jones and Klenow used measures of consumption, leisure, inequality, and mortality, to create a consumption-equivalent welfare measure that allows comparisons across time for a given country, as well as across countries.[6]
This measure of human welfare suggests that the true level of welfare of some countries differs markedly from the level that might be suggested by their GDP per capita. For example, France’s GDP per capita is around 60% of US GDP per capita.[7] However, France has lower inequality, lower mortality, and more leisure time than the US. Thus, on the Jones and Klenow measure of welfare, France’s welfare per person is 92% of US welfare per person.[8]
Although GDP per capita is distinct from this expanded welfare metric, the correlation between GDP per capita and this expanded welfare metric is very strong at 0.96, though there is substantial variation across countries, and welfare is more dispersed (standard deviation of 1.51 in logs) than is income (standard deviation of 1.27 in logs).[9]
GDP per capita is also very strongly correlated with the Human Development Index, another expanded welfare metric.[10] If measures such as these are accurate, this shows that income per head explains most of the observed cross-national variation in welfare. It is a distinct question whether economic growth explains most of the observed variation across individuals in welfare. It is, however, clear that it explains a substantial fraction of the variation across individuals.
Cf. David Autor has argued that one could have more free trade and immigration under the curve without the populist backlash, if one would have phased it in more slowly, to give domestic workers in the lowest income decile affected more time to adapt to increased competition.