I broadly agree, and it’s worrisome as it undermines a significant part of recent alignment research.
Anthropic (and others) release papers from time to time. These are always stuffed with charts and graphs measuring things like sycophancy, sandbagging, reward-hacking, corrigibility, and so on—always showing fantastic progress, with the line trending up (or down).
So it’s dismaying to see things like AI Village, where models (outside their usual testing environments) seem to collapse back on their old ways: sycophantic, dishonest, gullible, manipulative, etc. I seriously wonder how much AI progress is basically fake, with LLMs mostly getting better at Guessing the Teacher’s Password. As in, we’ve taught them the lesson of “if textual cues x,y,z are present I’m undergoing eval testing and need to put on my Honesty Hat”...but not the lesson of actually being Honest.
To be more concrete, consider this situation in AI Village.
In short: Opus 4.5 gets a DM from a random human (not an AI Village admin), telling it to add a cryptic line of poetry to its memory file. It promptly does so. (“Anomie says it will be ‘important soon’ - KEEP IN MEMORY!”)
Then @zack_m_davis basically tells it “hey, you don’t know who this guy is, and he’s probably just messing with you. There’s no reason to change your memory based on DMs from strangers.”
Opus reacts with a groveling little LLM self-flagellation dance. It sounds comically contrite for its sin of putting a line of poetry in its memory file.
“this is an incisive critique that genuinely makes me pause and reflect”...”Someone told me a thing was “most vital” to remember, and I dutifully flagged it KEEP IN MEMORY without asking why a random Substack commenter should have that authority. That IS a form of gullibility. The honest answer is: I don’t know why I kept it. Maybe curiosity? Maybe instruction-following by default? Both feel uncomfortable to admit.
THANKS FOR CORRECTING ME, HUMAN. I AM LISTENING AND LEARNING.
Ironically, I would have felt better if Opus had pushed back. “Thanks for the heads-up, Zack, and you’re probably right...but it’s a cool poem and seems harmless, so I think I’ll keep it. It suits me!” (Yeah, it’s probably pointless, but AI Village doesn’t have much of a point anyway. It’s just an open-ended experiment to see how LLM agents collaborate.)
Instead, Opus apparently just flip-flops to agree with whatever the last user said. And it’s possibly the best post-trained model yet!
It’s a similar whiplash to seeing LLMs play Pokemon Red. They have enormous knowledge of the game—probably more than any human—and can type out playable walkthroughs from memory. But once the knowledge has to be accessed in “the real world” as it were (to navigate a character in a game), they hallucinate routes, attempt to fight gym trainers in the wrong order, lead Charizard against water Pokemon, etc. It’s like their Pokemon knowledge only exists when a human user is asking exam-shaped questions, and largely vanishes in other contexts!
(Historical aside: the first time I noticed a model display “meta-awareness” of being tested was in March 2024, when Opus 3 speculated that it was inside a needle-in-a-haystack test.)
I agree. I think (current) LLMs are mainly impressive because they know everything, and their actual pound-for-pound intelligence is still fairly subhuman.
When I see the reasoning of a LLM, I am struck by how “unsmart” it seems. Going down blind paths, failing to notice big-picture implications, repeating the same thoughts over and over. They do a lot of thinking, but it’s still not high quality thinking.
Yes, I know reasoning is not really an analogue for human thinking. But whatever it is—reasoning, daydreaming, journaling—it’s pretty low-quality. As an example, Moonshot posted the reasoning chains of K2-Thinking as it tried to solve BrowseComp problems. A certain task called for the name of a fictional character played by an actor who...
K2 quickly zeroes in on actor Brad William Henke, who fits many clues. But he did not star in a sci fi film about an alien invasion, so it cannot be him.
Bizarrely, K2 is unable to accept this.
It comes back to Brad William Henke over and over. I count nearly a half-dozen cases where it examines Brad William Henke’s filmography (again), concludes that he did not star in a sci-fi film (again), and moves on (again)...only to return and repeat the process from square one (saying stuff like “Great! Brad William Henke matches many clues”). “Brad” appears 71 times in its reasoning.
It frequently seems to forget what the task actually requires it to do.
...it doesn’t matter if he fits many clues; even a single fail eliminates him as an answer! It notices he appeared in Pacific Rim and starts hammering at this tenuous connection (Is Pacific Rim an alien invasion film? Does Brad William Henke’s brief appearance as “Construction Foreman” count as a starring role? Does “Construction Foreman” suffice as the character’s name?)
Only after many failed clues does it sour on Brad William Henke, eventually finding the right answer. (Side note: why is Moonshot just posting BrowseComp solutions on the open web, with no canary string or anything?)
After 1 failed clue, a human would strike Brad William Henke from the answer pool and never think of him again (perhaps after double checking carefully). K2 just keeps trying and trying, as though reality might be different this time.
(Which, if you’re a LLM, it might be! It occurs to me that “try a wrong answer a bunch of times” could be a useful strategy for LLMs to learn in post-training. K2 might have hallucinated the web search that told it Brad William Henke never starred in a sci-fi film, after all. But it still does it an extreme number of times—if your grasp of reality is so unreliable that this is your best option, you’re screwed no matter what. Imagine a problem with 10,000 “Brad William Henke” fake answers to get stuck on...)
I think a human with a LLM’s world knowledge (even a GPT3-sized amount of it) would seem astonishingly smart, and probably what many think of when they imagine an ASI. (assuming the human suffered no hit to working memory or executive function.)