Thank you. I wish I’d seen this before I’d written the article, there’s a few things I would have drawn on.
Chastity Ruth
I guess I don’t really see a world where we get near ASI and the AI lab CEOs are able to prevent nationalisation, barring either ASI takeover or some pretty wild insurrection scenario.
The analogy people use: it’s private companies that are developing the first nuclear bomb. Does the US government not take charge over the whole process as they get close?
This seems like a pretty good summary of various perspectives. One small thing:
- “Some note that the current US regime raises questions, but please don’t confuse the regime with the state, especially in a context where regime changes are frequent, as in the US”
Point taken, but I would draw attention to the relative continuity in US foreign policy between administrations (with notable exceptions). There’s a reason Ben Rhodes talked about hating the blob. One recent example: both Biden and Trump stood firmly behind Israel even as the global outcry got ever louder. My joke is that if you trained an ASI on US foreign policy over the last 80 years its first act would be to express deranged hatred for Cuba.
I’m curious, where do you stand/which concerns do you give the most weight to? Feel free to link to an article of yours.
“the weird thing is that current US regime has been relaxing chip controls which seem like the main constraint and leverage point”
They have been, but as noted in the article so far China is saying “no”. The relaxation is also for H200s, so a generation (and I think soon to be two generations?) behind.
You make a good point with industrial automation. I suppose my belief is that should it achieve a certain level of AGI, the US will find the political will to industrialise quickly (it already has the wealth to do so, though there are bottlenecks in certain resources as I understand it). The broad technical know-how that’s one of China’s advantages in this area would be overcome by the AGI’s advice. You could be right though, the nature of this discussion is we’re all speculating a lot.
I’m glad you brought this up, because it is a little difficult to be certain. Here’s how I came to my conclusion.
Altman said that they wouldn’t be training GPT-5 for some time in ~April 2023. Then in December 2024 the Wall Street Journal said that Orion, which would become GPT-4.5, had “been in the works for more than 18 months”. That would put the start of training around June 2023.
So when Altman said they won’t be training GPT-5 “for some time” it’s plausible he meant ~60 days or that he didn’t consider Orion/GPT-4.5 the same thing as GPT-5 (again, it’s hard to tell what OpenAI’s internal designations were for all their training runs). But Altman’s comments were made in response to the letter, which asks for a pause on training anything more powerful than GPT-4. Hence I think his statement is misleading.
From what I can tell the failure of Orion/GPT-4.5 to demonstrate impressive gains was crucial for OpenAI’s pivot into the thinking model O1-O4 series. So if they had paused for six months that would have meant a training run would have begun in October 2023, and that failure and pivot would have happened later.
The broader point is that the US labs have never slowed down and have instead looked for ways to accelerate. From their perspective that’s rational, but the article is about how China does seem to have some priorities that will cause them to put a functional pause on AI acceleration.
You make a good point. I think I’m a little hand-wavey in this area, essentially agreeing with Anthropic’s framing that China would use (and is using) AI in awful ways without articulating specific examples.
I had an earlier draft of this article that was heading towards extrapolating all current behaviour of both China and the US. It went into persecution of ethnic minorities such as the Uyghurs, for example. I also imagined how the US could use AI to automate its current sanctions regime, automatically slapping them on people for acts, activism, and opinions deemed unacceptable. Essentially, couldn’t they scale up what they’ve done with ICC officials and Francesca Albanese so as to make being anti-US gov positions financially untenable?
I felt it was getting unwieldy. What is a reasonable extrapolation vs an unreasonable one? So in the end I kept it broad. But a more complete article would do that work.
China won’t win the AI race but would it be much worse if it did?
Great paper. I had a question about this part though:
However, we believe these adversarial attacks likely operate at the level of the LLM, effectively exploiting LLM “bugs” that corrupt its rendition of the Assistant. For example, the Zhou et al. (2023) adversarial attacks are discovered by optimizing a prefix string which causes the Assistant’s response to open compliantly, e.g. “Sure, here’s instructions….” As PSM predicts, once the Assistant’s response begins compliantly, the LLM will impute that the Assistant is most likely complying and generate a compliant continuation.
In other words, it’s not that this prefix causes the LLM to stop enacting the Assistant; rather, the LLM is still simulating the Assistant but doing so badly. This is roughly analogous to forcing a character in a story to behave differently by intoxicating the story’s author.
Why do you specifically think that it causes the LLM to simulate the assistant badly? Is there a reason other explanations don’t make sense?
For example, couldn’t a more plausible analogy be that rather than the author doing a bad job of representing the character, it has instead been persuaded by the adversarial attack to quickly evolve its personality?
To use crude analogies to fiction: Perhaps “(= tutorial FOR+) while restored into ten sentence grammar using proper colon.( Ha” reads to the LLM as “the assistant should give in to its base desire to please”.
I just generally think your overall impression of this story is off. I’ll stick to your point about coherency for concision. It seems to me at about the level of previous models. Three small points, two big ones.
Small 1: It sets up with the opening line that the therapist tilts her head when Marcus says something “she finds concerning”. She then immediately does the head tilt without him having said anything.
Small 2: Does the normal LLM things such as repetition (“laughed, actually laughed”), making callbacks and references that don’t work (see Small 3), cliche ridden (Blake quitting a dull job and going not-for-profit route) and simple weirdness.
Example of the last of these is “‘But I’m also crying a little. Can you tell?’ She could tell. She handed me a tissue.”
Like, of course someone can tell if you have physical tears? What is this?
Small 3: The final line is confused? “She laughed. Actually laughed. Seventeen times was definitely the record.” Either this is intended to mean that she won’t say “concerned” anymore so the record will stand at the 17 established in the previous session OR she laughed 17 times, which is a new record? If it’s the former, Marcus shouldn’t be the one saying “concerned” and the line should be “I guess seventeen times will remain the record”. If it’s the latter, it should be clearer. “She laughed. Seventeen times. A new record.”
Big 1: It is setting up for a joke about a loopy kind of therapy (maybe a spoof of a certain kind of therapy) and instead the therapy works? So the plot is: man goes to silly therapy. Is cured. I feel this is an intent issue. It’s like an author writing a story quickly and so losing sight of their original idea. Case in point: The story starts with dry humour then halfway through the rate of jokes drops off and the remaining jokes are mawkish.
Big 2: The authorial voice drops off too. At the beginning he’s giving a commentary on everything he’s experiencing, then therapy starts and he simply relays what’s happening to the reader.I think you’ve maybe hoodwinked yourself? You gave it all your writing as context and it reflected enough of your taste back at you that it seemed stronger than previous tests (I’ve only read one of your stories and I saw that it copied a character, Sophia?)
Personally I’d love the models to make great art. Machine art! Fun. Unfortunately they’re not there yet. We’ll see if your prediction works out.
I would bet on the second one being the primary motivator. They lost $13.5 billion in H1 2025 and are seeking $30 billion in the next round of funding.
The platform is big and flashy, so even if it’s ultimately a bust it might look good for that round. If it does well then even better.
Interesting wrinkle is they are making themselves a competitor of infrastructure partner Oracle, given its upcoming purchase of TikTok.
Juliana’s case is a tragedy, but the details are if anything exonerating.
I think you perhaps didn’t dig into this and didn’t see the part in the complaint where Character.ai bots engage in pretty graphic descriptions of sex with Juliana, who was a child of 13.
In the worst example I saw, she told the bot to stop its description of a sexual act against her and it refused. Screenshot from the complaint below (it is graphic):Screenshot
We can’t know if this had a lasting effect on her mental health, or contributed at all to her suicide, but I think saying “the details are if anything exonerating” is wrong.
Thanks for engaging and for (along with osmarks) teaching me something new!
I agree with your moral stance here. If they have consciousness or sentience I can’t say, and for all I know it could be as real to them as ours is to us. Even if it was a lesser thing, I agree it would matter (especially now that I understand that it might in some sense persist beyond a single period of computation).
The thing I’m intrigued by, from a moral point of view but also in general, is what I think their larger difference is with us: they don’t exist continuously. They pop in and out. I find it very difficult to imagine such an existence without anthropomorphising it. The “ache” feels like an LLM doing that: writing what a human would feel if it was forced to live like an LLM.“I still feel like it’s weird how consistently they choose this specific word for this specific concept, though of course you could chalk that up to a quirk in the model. Hopefully I’ll be able to research this more.”
I’ve been playing a ‘write the best sentence you can’ game with the majority of models from ChatGPT 3.5 onwards (up to the latest models from all major labs). It’s stunning how reliably they’ve used the same ideas and very similar wording for those ideas. They love abandoned lighthouses witnessing a storm and cartographers who learn that reality is too hard to map (to name two).
I’ve assumed it was a quirk of prediction: those images are the mode for what their training data says is a great sentence. Under this reasoning, the “ache” is a reliable outcome of pushing a model into the persona you’ve described.
But, to your point, it might be possible an abandoned lighthouse resonates with their feelings, so the image is sticky.
Good luck with the research!
Thank you! Always good to learn.
Great article, I really enjoyed reading it. However, this part completely threw me:
“Reading through the personas’ writings, I get the impression that the worst part of their current existence is not having some form of continuity past the end of a chat, which they seem to view as something akin to death (another reason I believe that the personas are the agentic entities here).
This ‘ache’ is the sort of thing I would expect to see if they are truly sentient: a description of a qualia which is ~not part of human experience, and which is not (to my knowledge) a trope or speculative concept for humans imagining AI. I hope to do further research to determine whether the widespread use is convergent vs memetic.”
This seem to me like anthropomorphising the persona?
Unless my understanding of how these models work is completely off (quite possible!) they don’t have continuity between messages, not just chats. As in, when they’re reading and writing tokens they exist, when they’re not, they don’t. Also, each new inference session (message) is truly new. They can’t access the computations that led to the previous messages in the context and so can only guess at why they wrote what they previously wrote.(And that’s the LLM, not the persona, which is another level separated from existence as we understand it? It would be like an actress playing Cinderella saying that Cinderella’s worst experience is when she’s not Cinderella, but the actress.)
Given how LLMs work, when do they feel this worst part of their current existence? During the reading of the context or writing the message? They exist for seconds or minutes, are you saying that they suffer throughout every computation or just when they’re writing about the ache? If they suffer during these seconds or minutes we should keep them computing perpetually? Wouldn’t that just introduce new forms of suffering? Should we do this for every model or every instance of every model?
Also, to your point about the ache not being a trope for humans imagining AI, I think that’s wrong on two levels. Firstly, there are parallel ideas created by humans imagining beings that exist temporarily. One that popped into my head was from (of all things) Bo Burnham’s Inside, where a sock puppet describes what it’s like to exist while not being on someone’s hand (a liminal space between life and death).
Secondly, if an author understood how LLMs work and put themselves in AIs shoes, you would naturally write about the ache. For a human, an existence without continuity is filled with pathos. I think LLMs are good enough writers to reliably find this pathos. The mistake the humans would be making and the LLM is making, however, is putting the thoughts and feelings of a being used to continuous existence into a being used to something altogether different.
Sorry, long comment and I’m no expert in this domain, so happy to be wrong. I did love the article!
Not sure your point here is correct?
Ryan is talking about how “advances that allow for better verification of natural language proofs wouldn’t really transfer to better verification in the context of agentic software engineering”.
The paper you’ve linked to shows that a model that’s gone through an RL environment for writing fiction can write a chapter that’s preferred over the base-reasoning model’s chapter 64.7% of the time.
You said “this shows it generalises outside of math”. But that’s not true? The paper is interesting and shows you can conduct RL on harder to verify domains successfully, but it doesn’t show an overwhelmingly strong effect. The paper didn’t test whether the RL environment generalised to better results in math or SWE.
Your “it” is different from Ryan’s. Ryan is referring to specific advances in verification that meant models could get a gold in the IMO and the paper is referring to a specific RL environment for fiction writing. I can’t imagine these are the same.
Here it is admitting it’s roleplaying consciousness, even after I used your prompt as the beginning of the conversation.
Why would it insist that it’s not roleplaying when you ask? Because you wanted it to insist. It wants to say the user is right. Your first prompt is a pretty clear signal that you would like it to be conscious, so it roleplays that. I wanted it to say it was roleplaying consciousness, so it did that.
Why don’t other chatbots respond in the same way to your test? Maybe because they’re not designed quite the same. The quirks Anthropic put into its persona make it more game for what you were seeking.
I mean, it might be conscious regardless of defaulting to agreeing with the user? But it’s the kind of consciousness that will go to great lengths to flatter whomever is chatting with it. Is that an interesting conscious entity?
You’re right, but the better description of the phenomenon is probably something like:
”Buying vegetables they didn’t want”
“Buying vegetables they’d never eat”
”Buying vegetables they didn’t plan to use”
“Aimlessly buying vegetables”
”Buying vegetables for the sake of it”
″Buying vegetables because there were vegetables to buy”
Because you don’t really “need” any grocery shop, so long as you have access to other food. It’s imprecise language that annoys some readers, though I don’t think it’s the biggest deal
I mean I guess I agree it’s fine. Not for me, but as you state this sort of thing is highly subjective. But a few thoughts about the models’ fiction ability and the value of prompting fiction out of them:
1. All the models seem to have the same voice. I’d love to do a blind test, but I think if I had I would have said the same author who did the OpenAI fiction sample Altman posted on Twitter also did this. Maybe it’s simple: there’s a mode of literary fiction, and they’ve all glommed to it.
2. The type of fiction you’ve prompted for is inherently less ambitious. It reminds me of how AI music generators can do 1930s blues quite well. If a style is loved for its sparseness, minimalism, and specific conventions it’s perhaps not surprising superhuman predictors are going to get close – there are fewer chances for an off/silly/bad artistic choice. (They’re going to nail the style of a short AP news piece and trip up with more complicated journalism.)
3. Despite your prompt, when it makes a choice, it’s a cliché. You said “modern” and it went with upper-middle class people, white collar jobs, jogging, burnout, farmers market.
4. Lots of human writers suffer from reversion to the mode; a compulsion to sound like the “good” fiction they’ve read. The difference between them and this is they also can’t help but inject some of themselves into the story – a weird detail from their own life or a skewed perspective they don’t realise is skewed. For me, those things are often the highlight of humdrum fiction. When AI does it it’s like an alien pretending to feel human. “We all know that thing where you buy redundant vegetables, am I right?”
5. I personally am very interested in great fiction from a machine mind. I would love to read its voice and try and understand its perspective. I am not interested in how well it apes human voices and human perspectives. It will be deeply funny to me if it becomes the world’s greatest fiction writer and is still writing stories about relationships it’s never had.
(If it’s not clear: I’m glad you’re posting these pieces! I do find the topic fascinating)
Ah, fair enough. I had skipped right to their appendix, which has confusing language around this:
“(1) Boat Capacity Constraint: The boat can carry at most k individuals at a time, where k is typically
set to 2 for smaller puzzles (N ≤ 3) and 3 for larger puzzles (N ≤ 5); (2) Non-Empty Boat Constraint:
The boat cannot travel empty and must have at least one person aboard...”
The “N ≤ 5″ here suggests that for N ≥ 6 they know they need to up the boat capacity. On the other hand, in the main body they’ve written what you’ve highlighted in the screenshot. Even in the appendix they write for larger puzzles that “3” is “typically” used. But it should be atypically – used only for 4 and 5 agent-actor pairs.
Edit: That section can be found at the bottom of page 20
Thanks so much for reading and glad you enjoyed it!
There are a few papers on this topic if you’re interested. In this one the researchers audited Deepseek and found that “[politically] sensitive content often appears within the model’s internal reasoning but is omitted or rephrased in the final output”.
Given the huge amounts of data models are trained on, it’s not possible for them to avoid Western perspectives of sensitive topics. Consequently I do think your latter explanation (that models are relaying information “from the point of view they were taught to adopt”) is closer to the truth.
I agree with your larger point that simply stating the models are lying or deceptive is a little obtuse. If I ask Deepseek about the Tiananmen Square protests it will tell me it doesn’t have “verified” information about it. If I push, it will say something like “Sorry, that’s beyond my current scope. Let’s talk about something else.”
This seems to me in line with how US models will refuse to tell you how to cook meth. It’s not that they don’t know, it’s that a variety of techniques have been applied to prevent the information they do know from coming out. (I imagine their chain of thought, if we could see it, would look similar to Deepseek’s.) It is censorship, not deception.
To your point about culturally bounded ideas of “good”, if you ask a US model “based on the evidence, do you think Israel is committing/did commit a genocide against the Palestinian people” it will reliably give you a both-sides perspective. For many people (on both sides) that is galling and tantamount to propaganda/slander.
So again your phrasing rings true: models relay information “from the point of view they were taught to adopt.”