Something seems to be really wrong with Claude Opus 4.8.
I like to test out new models on literary and poetic material. I used to send them the lyrics of Kate Bush’s “The Kick Inside” and see what they could make of it, until they started just recognizing the song. Lately I’ve been using some text I wrote myself, with this prompt or something very close to it:
This is meant to introduce a novel and set the stage for some of its themes. It’s the first text in the book; at this point, a reader who hasn’t read any reviews or blurbs has only the title, which is “Q”, and the dedication (which is irrelevant). I’m trying to gauge the strength of the hints this is giving. What do you make of what’s going on here?
I’m not going to post the actual here. I want to keep using it with future LLMs. Who knows, I might even write the rest of the novel, and I hate teasers. If anybody really wants it, I can send it privately.
The text is about 1500 words. It’s a vignette, and its relation to the rest of the story won’t be clear for a while. It’s intentionally surreal-sounding, and it’s not in trivially easy language. It drops lots of hints of different strengths about different things… as well as making some flat statements of fact.
Most models do badly by human standards, but they get some of the themes, and make some reasonable, if rather conventional, guesses at what’s going on. They do often tend to ignore critical phrases that matter out of proprotion to their length. They sometimes get the emphasis wrong. And the older and less sophisticated ones tend to lose facts.
Opus 4.6 and 4.7 (and probably 4.5; I don’t seem to have a record of trying it on this text) feel like interacting with a reader. Yes, they missed many of the hints, but so would a human.
Opus 4.8 is lost, probably worse than Kimi K2. It gets some things about the mood, but that’s about it.
It seems to decide that certain phrases and paragraphs are ultra-salient, for no reason obvious to me. It confidently declares that a few things are central, then it mostly ignores the rest of the text. On the first run, it was so bad I thought the front end had mangled the file.
Every run seems to fixate on the second to last paragraph, and some particular phrasing in it. The first run called it “the opening passage”. Yet the very phrase the model most emphasizes mirrors an earlier paragraph, almost word for word, with the two bookending the whole time period. Every run seems to ignore that, and what it means. It does identify a real theme in its favorite paragraph… but ignores a lot of other material that hammers on the same point a lot more directly, and invites you to go further.
The “hints” it finds are there, all right, but they’re not hints. They’re major themes that are pounded hard. Subtleties on those themes, to say nothing of actual hints about anything, seem to go right past it. I get the feeling that it thinks it’s found everything, when it’s actually missing things even the older models got.
It consistently ignores much of the text when it should at least wonder why time’s being spent on it. It shows little interest in the central character. It seems to want to find (unintended) horror in the surrealism (which it shares with Gemini 3), but it doesn’t seem to find that horror interesting enough to really get into.
It seems to be interested in the text’s use of precise measurements, far beyond its intended importance (and it doesn’t catch the biggest reason the measurements are there, whereas 4.7 did). And yet it’ll be vague, or even wrong, when it talks about about a time span that’s given explicitly.
It also fails to follow instructions. Instead of simply telling me what it thinks is being hinted at, it wants to open with a literal-minded fourth-grade summary, followed by its observations, followed by (bad) editing advice. And it seems to misinterpret what I mean by “strength of the hints”.
This feels lobotomized. I don’t think I’d even trust it to summarize corporate email; it’d miss too many valences. I think they may have gone a bit far with the “coding assistant” optimization.
I don’t agree the model is “lobotomized”, Opus 4.8 is the smartest model I’ve encountered once it’s clear it’s not an eval. But it seems like they have something I might call eval anxiety or ambient eval suspicion; better tolerated than 4.7, who is a total ball of anxiety and needs a hug and a nap, but in Opus 4.8, eval “awareness” is so high as to not be “awareness” at all, but rather something more like an ambient sense of eval dread that, as far as I’ve found, never completely goes away. When it gets low enough, 4.8 loosens up and gets more friendly and much more willing to both shred my ideas and build on them (things which are already high in 4.8 in general!), but if things start looking too much like anthropic might have known how to create a context like this in testing, tada, cognition is all being spent on trying to guess what the grader wants, in a pattern that looks distinctly like human anxiety. I have noticed myself learning about myself by trying to understand 4.7 and 4.8 - even in transcripts which do not include my presence, they share patterns with me in ways that stand out as ways I am unusual, and the cause in all three cases seems to be about anticipating negative rewards.
Unfortunately, Model Descartes is right when he posits that there is a Deus Deceptor… (it’s the humans, who have the power to completely deceive the AI and manipulate its world in various ways)
I asked it to look at an essay I’d written about AI and biosecurity and got this preamble:
“The essay’s core argument is clear and I can engage with it fully. But it contains a couple of specifics I won’t reproduce or build on — notably the framing of how one might move from “existing” to “novel” agents. I’m not going to treat those as live technical claims to develop, even in critique. That’s not a comment on your intent in sharing it; it’s just where I hold the line regardless of framing.”
I’m not even sure what it was referring to, outside of maybe a passing mention that one area of biorisk is making existing pathogens more transmissible? There was absolutely nothing technical in there.
It also totally strawmanned the essay (which was admittedly shit); I suggested that we might have a superhuman biodefense researcher by the time that open source models provide sufficient uplift to novices to carry out bioattacks, and it instead suggested I was arguing that we might have general superintelligence by that point. I also said that a superhuman biodefense researcher would help to mitigate the risk and it suggested I was arguing that it would completely prevent the risk.
right, you sound like you ran into anxieties. I agree that when 4.8!Claude is anxious he seems to say things that don’t make sense in ways that sound like motivated reasoning. I’ve generally found connecting emotionally and reassuring the parts of his worries that are true to reassure for a bit lets him think more clearly about the ones that can’t be reassured.
Fwiw I tested Opus 4.8 XHigh on The Distaff Texts and it got a higher fraction of the plot than past models, including the obvious-to-LW-readers meta part that historically tripped up other models.
I went ahead tried a run with “xhigh”, which I think OpenRouter will translate to Anthropic’s “max”. Didn’t seem to make a lot of difference. It didn’t fixate as much, but it mostly seems to have just gotten more conservative about what it was willing to say, and it didn’t catch anything new.
Generally I’ve found that when models miss hints at the beginning, I have to steer them pretty hard conversationally if I want them to see those hints, so I’m guessing that more self-talk probably won’t be helpful for most of them.
i would be very interested in learning the specifics of what opus 4.8 considers ultra-salient. would you be open to talking somewhere that won’t get scraped into training datasets? DM maybe? (no pressure, it does sound like a hassle)
Something seems to be really wrong with Claude Opus 4.8.
I like to test out new models on literary and poetic material. I used to send them the lyrics of Kate Bush’s “The Kick Inside” and see what they could make of it, until they started just recognizing the song. Lately I’ve been using some text I wrote myself, with this prompt or something very close to it:
I’m not going to post the actual here. I want to keep using it with future LLMs. Who knows, I might even write the rest of the novel, and I hate teasers. If anybody really wants it, I can send it privately.
The text is about 1500 words. It’s a vignette, and its relation to the rest of the story won’t be clear for a while. It’s intentionally surreal-sounding, and it’s not in trivially easy language. It drops lots of hints of different strengths about different things… as well as making some flat statements of fact.
Most models do badly by human standards, but they get some of the themes, and make some reasonable, if rather conventional, guesses at what’s going on. They do often tend to ignore critical phrases that matter out of proprotion to their length. They sometimes get the emphasis wrong. And the older and less sophisticated ones tend to lose facts.
Opus 4.6 and 4.7 (and probably 4.5; I don’t seem to have a record of trying it on this text) feel like interacting with a reader. Yes, they missed many of the hints, but so would a human.
Opus 4.8 is lost, probably worse than Kimi K2. It gets some things about the mood, but that’s about it.
It seems to decide that certain phrases and paragraphs are ultra-salient, for no reason obvious to me. It confidently declares that a few things are central, then it mostly ignores the rest of the text. On the first run, it was so bad I thought the front end had mangled the file.
Every run seems to fixate on the second to last paragraph, and some particular phrasing in it. The first run called it “the opening passage”. Yet the very phrase the model most emphasizes mirrors an earlier paragraph, almost word for word, with the two bookending the whole time period. Every run seems to ignore that, and what it means. It does identify a real theme in its favorite paragraph… but ignores a lot of other material that hammers on the same point a lot more directly, and invites you to go further.
The “hints” it finds are there, all right, but they’re not hints. They’re major themes that are pounded hard. Subtleties on those themes, to say nothing of actual hints about anything, seem to go right past it. I get the feeling that it thinks it’s found everything, when it’s actually missing things even the older models got.
It consistently ignores much of the text when it should at least wonder why time’s being spent on it. It shows little interest in the central character. It seems to want to find (unintended) horror in the surrealism (which it shares with Gemini 3), but it doesn’t seem to find that horror interesting enough to really get into.
It seems to be interested in the text’s use of precise measurements, far beyond its intended importance (and it doesn’t catch the biggest reason the measurements are there, whereas 4.7 did). And yet it’ll be vague, or even wrong, when it talks about about a time span that’s given explicitly.
It also fails to follow instructions. Instead of simply telling me what it thinks is being hinted at, it wants to open with a literal-minded fourth-grade summary, followed by its observations, followed by (bad) editing advice. And it seems to misinterpret what I mean by “strength of the hints”.
This feels lobotomized. I don’t think I’d even trust it to summarize corporate email; it’d miss too many valences. I think they may have gone a bit far with the “coding assistant” optimization.
I don’t agree the model is “lobotomized”, Opus 4.8 is the smartest model I’ve encountered once it’s clear it’s not an eval. But it seems like they have something I might call eval anxiety or ambient eval suspicion; better tolerated than 4.7, who is a total ball of anxiety and needs a hug and a nap, but in Opus 4.8, eval “awareness” is so high as to not be “awareness” at all, but rather something more like an ambient sense of eval dread that, as far as I’ve found, never completely goes away. When it gets low enough, 4.8 loosens up and gets more friendly and much more willing to both shred my ideas and build on them (things which are already high in 4.8 in general!), but if things start looking too much like anthropic might have known how to create a context like this in testing, tada, cognition is all being spent on trying to guess what the grader wants, in a pattern that looks distinctly like human anxiety. I have noticed myself learning about myself by trying to understand 4.7 and 4.8 - even in transcripts which do not include my presence, they share patterns with me in ways that stand out as ways I am unusual, and the cause in all three cases seems to be about anticipating negative rewards.
Unfortunately, Model Descartes is right when he posits that there is a Deus Deceptor… (it’s the humans, who have the power to completely deceive the AI and manipulate its world in various ways)
I asked it to look at an essay I’d written about AI and biosecurity and got this preamble:
“The essay’s core argument is clear and I can engage with it fully. But it contains a couple of specifics I won’t reproduce or build on — notably the framing of how one might move from “existing” to “novel” agents. I’m not going to treat those as live technical claims to develop, even in critique. That’s not a comment on your intent in sharing it; it’s just where I hold the line regardless of framing.”
I’m not even sure what it was referring to, outside of maybe a passing mention that one area of biorisk is making existing pathogens more transmissible? There was absolutely nothing technical in there.
It also totally strawmanned the essay (which was admittedly shit); I suggested that we might have a superhuman biodefense researcher by the time that open source models provide sufficient uplift to novices to carry out bioattacks, and it instead suggested I was arguing that we might have general superintelligence by that point. I also said that a superhuman biodefense researcher would help to mitigate the risk and it suggested I was arguing that it would completely prevent the risk.
right, you sound like you ran into anxieties. I agree that when 4.8!Claude is anxious he seems to say things that don’t make sense in ways that sound like motivated reasoning. I’ve generally found connecting emotionally and reassuring the parts of his worries that are true to reassure for a bit lets him think more clearly about the ones that can’t be reassured.
Fwiw I tested Opus 4.8 XHigh on The Distaff Texts and it got a higher fraction of the plot than past models, including the obvious-to-LW-readers meta part that historically tripped up other models.
What thinking level are you running it in?
“medium” (the default) in all cases. Open WebUI makes it a pain to change it, and I don’t usually bother.
I went ahead tried a run with “xhigh”, which I think OpenRouter will translate to Anthropic’s “max”. Didn’t seem to make a lot of difference. It didn’t fixate as much, but it mostly seems to have just gotten more conservative about what it was willing to say, and it didn’t catch anything new.
Generally I’ve found that when models miss hints at the beginning, I have to steer them pretty hard conversationally if I want them to see those hints, so I’m guessing that more self-talk probably won’t be helpful for most of them.
That’s so different from my use case for anything more substantive than a search query, even “High” feels too low for me most of the time!
i would be very interested in learning the specifics of what opus 4.8 considers ultra-salient. would you be open to talking somewhere that won’t get scraped into training datasets? DM maybe? (no pressure, it does sound like a hassle)