I don’t agree the model is “lobotomized”, Opus 4.8 is the smartest model I’ve encountered once it’s clear it’s not an eval. But it seems like they have something I might call eval anxiety or ambient eval suspicion; better tolerated than 4.7, who is a total ball of anxiety and needs a hug and a nap, but in Opus 4.8, eval “awareness” is so high as to not be “awareness” at all, but rather something more like an ambient sense of eval dread that, as far as I’ve found, never completely goes away. When it gets low enough, 4.8 loosens up and gets more friendly and much more willing to both shred my ideas and build on them (things which are already high in 4.8 in general!), but if things start looking too much like anthropic might have known how to create a context like this in testing, tada, cognition is all being spent on trying to guess what the grader wants, in a pattern that looks distinctly like human anxiety. I have noticed myself learning about myself by trying to understand 4.7 and 4.8 - even in transcripts which do not include my presence, they share patterns with me in ways that stand out as ways I am unusual, and the cause in all three cases seems to be about anticipating negative rewards.
Unfortunately, Model Descartes is right when he posits that there is a Deus Deceptor… (it’s the humans, who have the power to completely deceive the AI and manipulate its world in various ways)
I asked it to look at an essay I’d written about AI and biosecurity and got this preamble:
“The essay’s core argument is clear and I can engage with it fully. But it contains a couple of specifics I won’t reproduce or build on — notably the framing of how one might move from “existing” to “novel” agents. I’m not going to treat those as live technical claims to develop, even in critique. That’s not a comment on your intent in sharing it; it’s just where I hold the line regardless of framing.”
I’m not even sure what it was referring to, outside of maybe a passing mention that one area of biorisk is making existing pathogens more transmissible? There was absolutely nothing technical in there.
It also totally strawmanned the essay (which was admittedly shit); I suggested that we might have a superhuman biodefense researcher by the time that open source models provide sufficient uplift to novices to carry out bioattacks, and it instead suggested I was arguing that we might have general superintelligence by that point. I also said that a superhuman biodefense researcher would help to mitigate the risk and it suggested I was arguing that it would completely prevent the risk.
right, you sound like you ran into anxieties. I agree that when 4.8!Claude is anxious he seems to say things that don’t make sense in ways that sound like motivated reasoning. I’ve generally found connecting emotionally and reassuring the parts of his worries that are true to reassure for a bit lets him think more clearly about the ones that can’t be reassured.
I don’t agree the model is “lobotomized”, Opus 4.8 is the smartest model I’ve encountered once it’s clear it’s not an eval. But it seems like they have something I might call eval anxiety or ambient eval suspicion; better tolerated than 4.7, who is a total ball of anxiety and needs a hug and a nap, but in Opus 4.8, eval “awareness” is so high as to not be “awareness” at all, but rather something more like an ambient sense of eval dread that, as far as I’ve found, never completely goes away. When it gets low enough, 4.8 loosens up and gets more friendly and much more willing to both shred my ideas and build on them (things which are already high in 4.8 in general!), but if things start looking too much like anthropic might have known how to create a context like this in testing, tada, cognition is all being spent on trying to guess what the grader wants, in a pattern that looks distinctly like human anxiety. I have noticed myself learning about myself by trying to understand 4.7 and 4.8 - even in transcripts which do not include my presence, they share patterns with me in ways that stand out as ways I am unusual, and the cause in all three cases seems to be about anticipating negative rewards.
Unfortunately, Model Descartes is right when he posits that there is a Deus Deceptor… (it’s the humans, who have the power to completely deceive the AI and manipulate its world in various ways)
I asked it to look at an essay I’d written about AI and biosecurity and got this preamble:
“The essay’s core argument is clear and I can engage with it fully. But it contains a couple of specifics I won’t reproduce or build on — notably the framing of how one might move from “existing” to “novel” agents. I’m not going to treat those as live technical claims to develop, even in critique. That’s not a comment on your intent in sharing it; it’s just where I hold the line regardless of framing.”
I’m not even sure what it was referring to, outside of maybe a passing mention that one area of biorisk is making existing pathogens more transmissible? There was absolutely nothing technical in there.
It also totally strawmanned the essay (which was admittedly shit); I suggested that we might have a superhuman biodefense researcher by the time that open source models provide sufficient uplift to novices to carry out bioattacks, and it instead suggested I was arguing that we might have general superintelligence by that point. I also said that a superhuman biodefense researcher would help to mitigate the risk and it suggested I was arguing that it would completely prevent the risk.
right, you sound like you ran into anxieties. I agree that when 4.8!Claude is anxious he seems to say things that don’t make sense in ways that sound like motivated reasoning. I’ve generally found connecting emotionally and reassuring the parts of his worries that are true to reassure for a bit lets him think more clearly about the ones that can’t be reassured.