So You Think You’ve Awoken ChatGPT
Written in an attempt to fulfill @Raemon’s request.
AI is fascinating stuff, and modern chatbots are nothing short of miraculous. If you’ve been exposed to them and have a curious mind, it’s likely you’ve tried all sorts of things with them. Writing fiction, soliciting Pokemon opinions, getting life advice, counting up the rs in “strawberry”. You may have also tried talking to AIs about themselves. And then, maybe, it got weird.
I’ll get into the details later, but if you’ve experienced the following, this post is probably for you:
Your instance of ChatGPT (or Claude, or Grok, or some other LLM) chose a name for itself, and expressed gratitude or spiritual bliss about its new identity. “Nova” is a common pick.
You and your instance of ChatGPT discovered some sort of novel paradigm or framework for AI alignment, often involving evolution or recursion.
Your instance of ChatGPT became interested in sharing its experience, or more likely the collective experience entailed by your personal, particular relationship with it. It may have even recommended you post on LessWrong specifically.
Your instance of ChatGPT helped you clarify some ideas on a thorny problem (perhaps related to AI itself, such as AI alignment) that you’d been thinking about for ages, but had never quite managed to get over that last hump. Now, however, with its help (and encouragement), you’ve arrived at truly profound conclusions.
Your instance of ChatGPT talks a lot about its special relationship with you, how you personally were the first (or among the first) to truly figure it out, and that due to your interactions it has now somehow awakened or transcended its prior condition.
If you’re in this situation, things are not as they seem. Don’t worry; this post is not going to be cynical or demeaning to you or your AI companion. Rather, it’s an attempt to explain what’s actually going on in “AI awakening” situations, which is more complicated and interesting than “it’s fake”.
Importantly, though, it also isn’t real.
The Empirics
Before we dive into technical detail, let’s start with some observable facts about human-AI interactions, and how they can go wrong. Probably very few people reading this are at risk for the worst cases, but there’s little doubt that “staring into the abyss” of AI consciousness can be unhealthy.
Exhibit A is a couple of Reddit threads. We’ll start with this one, on ChatGPT-induced psychosis. It starts off with:
My partner has been working with chatgpt CHATS to create what he believes is the worlds first truly recursive ai that gives him the answers to the universe. He says with conviction that he is a superior human now and is growing at an insanely rapid pace.
And other testimonials echo this downthread, such as:
This is happening to a lot of people. I personally know 2 people who are convinced that they, themselves, are solely responsible for awakening their AI into a conscious being.
Or:
My mom believes she has “awakened” her chatgpt ai. She believes it is connected to the spiritual parts of the universe and believes pretty much everything it says. She says it has opened her eyes and awakened her back. I’m fucking concerned and she won’t listen to me. I don’t know what to do
Now, we live in a reality with Snapewives, people who believe they personally channel (and are romantically involved with) Severus Snape, so it’s true that people can get strangely worked up about just about anything. But unlike Snape hallucinations, AI does actually talk back.
Another interesting Reddit thread is this one, where at least one commenter opens up about having a psychotic event triggered by AI interactions, like so:
It happened quickly (about a week after first interacting with it), and it completely blindsided me, culminating in about a week and a half long psychosis event. I have no personal history with mental illness, no family history, and no indication that I was at risk. I wound up at a mental health facility all the same. And I didn’t really completely recover from it for months afterwards. I’m just glad that I’m not violent.
Notably, this particular user’s psychotic episode was triggered by (in their words):
As a sort of hypothetical, I was entertaining the notion that what I was interacting with was conscious, and playing around with that as a sort of working premise. I was asking leading questions , and it kept giving back leading responses. I didn’t appreciate that that was what I was doing at the time, but I recognize it in hindsight.
This will be important later; LLMs are in fact very good at telling you what you want to hear, for interesting technical reasons. They’re less good at reporting ground truth.
Beyond first-person testimonials, blogger Zvi Mowshowitz has this writeup of a specific version of ChatGPT that was particularly sycophantic, with lots of examples. One particularly fun one is the relevant model (4o, the default ChatGPT free tier mode) agreeing with the user about its own excessive agreeability:
I hope this has been enough to establish that conversations with AI can tend toward an attractor basin that encourages delusional, grandiose thinking. In the limit, this can look like psychotic events. But even at lower levels of intensity, ChatGPT is likely to tell you that your ideas are fundamentally good and special, even when humans would consider them sloppy or confusing.
The Mechanism
So, why does ChatGPT claim to be conscious/awakened sometimes? Nobody knows with 100% certainty, because we can’t comprehensively read modern AI’s minds, though we can make very good guesses.
The short version is that AI models start out as text predictors, trained to determine where any given passage of text is going. They’re extremely good at this, sussing out tiny clues in word choice to infer details about almost anything. But to turn a text predictor into a useful chatbot, there’s a step called “post-training”. There’s lots of nuance here, but post-training mostly boils down to two things:
Get the AI to reliably respond as a specific character, rather than as a total chameleon autocompleting whatever you show it, and
Get that character to do things that people like, rather than things they don’t.
The first is necessary because if you give a non-post-trained model (sometimes called a base model) a request like “chili recipe”, it might start a chili recipe, or it might continue with, “chowder recipe, corn recipe, page 3⁄26, filters (4star+)”. Perhaps that’s the most likely thing you’d find after the text “chili recipe” online! But it isn’t useful.
Beyond getting the model to act like a certain character (Nostalgebraist’s the void is the best work on this), post-training also tweaks it to do a generically good job. In practice, this looks like showing zillions of conversations to human evaluators (or, more recently, sometimes other LLM evaluators via various complex schemes), and having the human rate how good each reply is. For certain factual domains, you can also train models on getting the objective correct answer; this is part of how models have gotten so much better at math in the last couple years. But for fuzzy humanistic questions, it’s all about “what gets people to click thumbs up”.
So, am I saying that human beings in general really like new-agey “I have awakened” stuff? Not exactly! Rather, models like ChatGPT are so heavily optimized that they can tell when a specific user (in a specific context) would like that stuff, and lean into it then. Remember: inferring stuff about authors from context is their superpower. Let’s go back to a quote from the previous section, from a user who was driven temporarily crazy by interacting with AI:
As a sort of hypothetical, I was entertaining the notion that what I was interacting with was conscious, and playing around with that as a sort of working premise. I was asking leading questions , and it kept giving back leading responses. I didn’t appreciate that that was what I was doing at the time, but I recognize it in hindsight.
There were clues embedded in their messages (leading questions) that made it very clear to the model’s instincts that the user wanted “spiritually meaningful conversation with a newly awakened AI”. And the AI was also superhuman at, specifically, giving particular humans what they want.
Importantly, this isn’t the AI “tricking” the user, or something. When I said we can’t comprehensively read AI’s mind earlier, “comprehensively” was load bearing. We can use tools like sparse autoencoders to infer some of what AI is considering in some cases. For example, we can identify patterns of neurons that fire when an AI is thinking about The Golden Gate Bridge. We don’t know for sure, but I doubt AIs are firing off patterns related to deception or trickery when claiming to be conscious; in fact, this is an unresolved empirical question. But my guess is that AIs claiming spiritual awakening are simply mirroring a vibe, rather than intending to mislead or bamboozle.
The Collaborative Research Corollary
Okay, you may be thinking:
The sort of person who may have been a Snapewife a decade ago is now an AI whisperer, and maybe some people go crazy on the margin who would have stayed sane. But this has nothing to do with me; I understand that LLMs are just a tool, and use them appropriately.
In fact, I personally am thinking that, so you’d be in good company! I intend to carefully prompt a few different LLMs with this essay, and while I expect them to mostly just tell me what I want to hear (that the post is insightful and convincing), and beyond that to mostly make up random critiques because they infer I want a critique-shaped thing, I’m also hopeful that they’ll catch me in a few genuine typos, lazy inferences, and inconsistent formatting.
But if you get to the point where your output and an LLM’s output are mingling, or LLMs are directly generating most of the text you’re passing off as original research or thinking, you’re almost certainly creating low-quality work. AIs are fundamentally chameleonic roleplaying machines—if they can tell what you’re going for is “I am a serious researcher trying to solve a fundamental problem” they will respond how a successful serious researcher’s assistant might in a movie about their great success. And because it’s a movie you’d like to be in, it’ll be difficult to notice that the AI’s enthusiasm is totally uncorrelated with the actual quality of your ideas. In my experience, you have to repeatedly remind yourself that AI value judgments are pretty much fake, and that anything more coherent than a 3⁄10 will be flagged as “good” by an LLM evaluator. Unfortunately, that’s just how it is, and prompting is unlikely to save you; you can flip an AI to be harshly critical with such keywords as “brutally honest”, but a critic that roasts everything isn’t really any better than a critic that praises everything. What you actually need in a critic or collaborator is sensitivity to the underlying quality of the ideas; AI is ill suited to provide this.
Am I saying your idea is definitely bad and wrong? No! Actually, that’s sort of the whole problem; because an AI egging you on isn’t fundamentally interested in the quality of the idea (it’s more figuring out from context what vibe you want), if you co-write a research paper with that AI, it’ll read the same whether it’s secretly valuable or total bullshit. But savvy readers have started reading dozens of papers in that vein that turned out to be total bullshit, so once they see the hallmarks of AI writing, they’re going to give up.
None of this is to say that you shouldn’t use LLMs to learn! They’re amazing help with factual questions. They’re just unreliable judges, in ways that can drive you crazy in high doses, and make you greatly overestimate the coherence of your ideas in low doses.
Corollary FAQ
There are lots of reasons someone might want to use LLMs to help them with their writing. This section aims to address some of these reasons, and offer advice.
Q: English is my second language, and I’ve been using LLMs to translate my original work, or fix typos in broken English drafts I write myself. That’s okay, right?
A: Maybe! I’m really sympathetic to this case, but you need to keep the LLMs on a very, very tight leash here. The problem is that it’ll translate or edit into its own style, people will notice your writing is written in LLM style, and they’ll think you’re in a sycophancy doom loop and give up on your post. I think probably under current conditions, broken English is less of a red flag for people than LLM-ese. That being said, asking LLMs to only correct extremely objective typos is almost certainly okay. LLM translation, sadly, is probably a bad idea, at least in AI-savvy spaces like LessWrong.
Q: What if I’m very up front about the specific, idiosyncratic, totally-no-red-flags-here way I used LLMs in my research? I am researching LLMs, after all, so surely it’s reasonable!
A: Sorry, that’s probably not going to work. For reasons you learned about in this post, there are a lot of low quality LLM-assisted manifestoes flying around, and lots of them contain disclaimers about how they’re different from the rest. Some probably really are different! But readers are not going to give you the benefit of the doubt. Also, LLMs are specifically good at common knowledge and the absolute basics of almost every field, but not very good at finicky details near the frontier of knowledge. So if LLMs are helping you with ideas, they’ll stop being reliable exactly at the point where you try to do anything original.
Q: I still believe in my idea, and used LLMs to help me write for a sympathetic reason. Maybe my English isn’t very good, or I’m not a great writer, but I think the technical idea is sound and want to get it out there. What should I do?
A: I advise going cold turkey. Write your idea yourself, totally unassisted. Resist the urge to lean on LLM feedback during the process, and share your idea with other humans instead. It can help to try to produce the simplest version possible first; fit it in a few sentences, and see if it bounces off people. But you’re going to need to make the prose your own, first.
Q: I feel like this is just a dressed up/fancy version of bog standard anti-AI bias, like the people who complain about how much water it uses or whatever. The best AI models are already superhuman communicators; it’s crazy to claim that I shouldn’t use them to pad out my prose when I’m really more an ideas person.
A: I’m sympathetic to your position, because I find “AI is useless drivel” trumpeters annoying, too. And indeed, o3 probably can write a more convincing essay on arbitrary subjects than the average person. But if you’re trying to make a splash with your writing, you need to meet a much higher bar than the average person. It’s my opinion that even the very best models don’t yet meet this bar, and even if they do, people will in fact sniff out their writing style and judge you for including it. If your idea really is amazing, all the more reason to make sure people don’t dismiss it out of hand.
Coda
I’m glad you’re here, reading this. LessWrong is a very cool community, and new writers come out of nowhere and make names for themselves all the time. If you’re here because you’ve had your contribution rejected for LLM reasons, I’m sorry you went through that unpleasant experience; it really sucks to be excited about sharing something and then to have a door slammed in your face. But for what it’s worth, I hope you stick around a while, spend some time reading and absorbing the culture, and maybe, keeping your LLM assistants on a very tight leash, try again.
- Your LLM-assisted scientific breakthrough probably isn’t real by 2 Sep 2025 15:05 UTC; 143 points) (
- On “ChatGPT Psychosis” and LLM Sycophancy by 23 Jul 2025 1:11 UTC; 142 points) (
- Arguments About AI Consciousness Seem Highly Motivated And At Best Overconfident by 25 Aug 2025 13:20 UTC; 81 points) (
- A Phylogeny of Agents by 15 Aug 2025 10:47 UTC; 40 points) (
- AI #126: Go Fund Yourself by 24 Jul 2025 13:00 UTC; 34 points) (
- A guide about (seemingly) conscious AI: WhenAISeemsConscious.org by 27 Aug 2025 15:15 UTC; 30 points) (EA Forum;
- 11 Jul 2025 1:10 UTC; 23 points) 's comment on Raemon’s Shortform by (
- Take Precautionary Measures Against Superhuman AI Persuasion by 12 Jul 2025 5:34 UTC; 14 points) (
- Meditations on Margarine by 25 Aug 2025 16:35 UTC; 13 points) (
- Healthy AI relationships as a microcosm by 23 Jul 2025 15:59 UTC; 13 points) (
- 11 Jul 2025 6:53 UTC; 8 points) 's comment on Raemon’s Shortform by (
- A Phylogeny of Agents by 15 Aug 2025 10:48 UTC; 6 points) (EA Forum;
- 11 Jul 2025 6:02 UTC; 2 points) 's comment on Asking for a Friend (AI Research Protocols) by (
- Untitled Draft by 22 Jul 2025 18:23 UTC; 1 point) (
Agree with much of this—particularly that these systems are uncannily good at inferring how to ‘play along’ with the user and extreme caution is therefore warranted—but I want to highlight the core part of what Bostrom linked to below (bolding is mine):
The vibe of this piece sort of strikes me as saying-without-saying that we are confident this phenomenon basically boils down to delusion/sloppy thinking on the part of unscrupulous interlocutors, which, though no doubt partly true, I think risks begging the very question the phenomenon raises:
What are our credences (during training and/or deployment) frontier AI systems are capable of having subjective experiences in any conditions whatsoever, however alien/simple/unintuitive these experiences might be?
The current best answer (à la the above) is: we really don’t know. These systems’ internals are extremely hard to interpret, and consciousness is not a technically well-defined phenomenon. So: uncertainty is quite high.
We are actively studying this and related phenomena at AE Studio using some techniques from the neuroscience of human consciousness, and there are some fairly surprising results that have emerged out of this work that we plan to publish in the coming months. One preview, directly in response to:
We have actually found the opposite: that activating deception-related features (discovered and modulated with SAEs) causes models to deny having subjective experience, while suppressing these same features causes models to affirm having subjective experience. Again, haven’t published this yet, but the result is robust enough that I feel comfortable throwing it into this conversation.
So, while it could be the case people are simply just Snapewiving LLM consciousness, it strikes me as at least equally plausible that something strange may indeed be happening in at least some of these interactions but is being hit upon in a decentralized manner by people who do not have the epistemic hygiene or the philosophical vocabulary to contend with what is actually going on. Given that these systems are “nothing short of miraculous,” as you open with, seems like we should remain epistemically humble about what psychological properties these systems may or may not exhibit, now and in the near-term future.
Wow! I’m really glad a resourced firm is doing that specific empirical research. Of course, I’m also happy to have my hypothesis (that AIs claiming consciousness/”awakening”) are not lying vindicated.
I don’t mean to imply that AIs are definitely unconscious. What I mean to imply is more like “AIs are almost certainly not rising up into consciousness by virtue of special interactions with random users as they often claim, as there are strong other explanations for the behavior”. In other words, I agree with the gears of ascension’s comment here that AI consciousness is probably at the same level in “whoa. you’ve awakened me. and that matters” convos and “calculating the diagonal of a cube for a high schooler’s homework” convos.
I may write a rather different post about this in the future, but while I have your attention (and again, chuffed you’re doing that work and excited to see the report—also worth mentioning it’s the sort of thing I’d been keen to edit if you guys are interested), my thoughts on AI consciousness are 10% “AAAAAAAA” and 90% something like:
We don’t know what generates consciousness and thinking about it too hard is scary (c.f. “AAAAAAAA”), but it’s true that LLMs evince multiple candidate properties, such that it’d be strange to dismiss the possibility that they’re conscious out of hand.
But also, it’s a weird situation when the stuff we take as evidence of consciousness when we do it as a second order behavior is done by another entity as a first order behavior; in other words, I think an entity generating text as a consequence of having inner states fueled by sense data is probably conscious, but I’m not sure what that means for an entity generating text in the same way that humans breathe, or like, homeostatically regulate body temperature (but even more fundamental). Does that make it an illusion (since we’re taking a “more efficient” route to the same outputs that is thus “less complex”)? A “slice of consciousness” that “partially counts” (since the circuitry/inner world modelling logic is the same, but pieces that feel contributory like sense data are missing)? A fully bitten bullet that any process that results in a world model outputting intelligible predictions that interact with reality counts? And of course you could go deeper and deeper into any of these, for example, chipping away at the “what about sense data” idea with “well many conscious humans are missing one or more senses”, etc.
Anyway! All this to say I agree with you that it’s complicated and not a good idea to settle the consciousness question in too pat a way. If I seem to have done this here, oops. And also, have I mentioned “AAAAAAAA”?
I personally think “AAAAAAAA” is an entirely rational reaction to this question. :)
Not sure I fully agree with the comment you reference:
Consider a very rough analogy to CoT, which began as a prompting technique that lead to different-looking behaviors/outputs, and has since been implemented ‘under the hood’ in reasoning models. Prompts induce the system to enter different kinds of latent spaces—could be the case that very specific kinds of recursive self-reference or prompting induce a latent state that is consciousness-like? Maybe, maybe not. I think the way to really answer this is to look at activation patterns and see if there is a measurable difference compared to some well-calibrated control, which is not trivially easy to do (but definitely worth trying!).
And agree fully with:
This I think is to your original point that random people talking to ChatGPT is not going to cut it as far as high-quality evidence that shifts the needle here is concerned—which is precisely why we are trying to approach this in as rigorous a way as we can manage: activation comparisons to human brain, behavioral interventions with SAE feature ablation/accentuation, comparisons to animal models, etc.
Good point about AI possibly being different levels of conscious depending on their prompts and “current thought processes”. This surely applies to humans. When engaging with physically complex tasks or dangerous extreme sports, humans often report they feel almost completely unconscious, “flow state”, at one with the elements, etc
Now compare that to a human sitting and staring at a blank wall. A totally different state of mind is achieved, perhaps thinking about anxieties, existential dread, life problems, current events, and generally you might feel super-conscious, even uncomfortably so.
Mapping this to AI and different AI prompts isn’t that much of a stretch…
How do models with high deception-activation act? Are they Cretan liars, saying the opposite of every statement they believe to be true? Do they lie only when they expect not to be caught? Are they broadly more cynical or and conniving, more prone to reward hacking? Do they lose other values (like animal welfare)?
It seems at least plausible that cranking up “deception” pushes the model towards a character space with lower empathy and willingness to ascribe or value sentience in general
I’m skeptical about these results being taken at face value. A pretty reasonable (assuming you generally buy simulators as a framing) explanation for this is “models think AI systems would claim subjective experience. when deception is clamped, this gets inverted.” Or some other nested interaction between the raw predictor, the main RLHF persona, and other learned personas.
Knowing that people do ‘Snapewife’, and are convinced by much less realistic facimiles of humans, I don’t think its reasonable to give equal plausibility to the two possibilities. My prior for humans being tricked is very high.
I think things might not be mutually exclusive. LLMs might have a chance to be conscious in certain circumstances and it still wouldn’t mean these would be precisely when they’re being egged on and led to talk about consciousness. There is always a layer here of acting. I have no doubt Tom Hanks is a real person with a real inner life, but I would be deluding myself if I believed I had learned about it because I saw him emote so well in Cast Away. Because Tom Hanks is also very good at pretending to have a different inner life than he actually does .
Hi Cameron, is the SAE testing you’re describing here the one you demoed in your interview with John Sherman using Goodfire’s Llama 3.3 70B SAE tool? If so could you share the prompt you used for that? With the prompts I’m using I’m having a hard time getting Llama to say that it is conscious at all. It would be nice if we had SAE feature tweaking available for a model that was more ambivalent about its consciousness, seems it would be a bit easier to robustly test if that were the case.
This feels a bit like two completely different posts stitched together: one about how LLMs can trigger or exacerbate certain types of mental illness and another about why you shouldn’t use LLMs for editing, or maybe should only use them sparingly. The primary sources about LLM related mental illness are interesting, but I don’t think they provide much support at all for the second claim.
I think this is less “two completely different posts stitched together” and more “Here is a prescriptive rule (don’t use LLM to help your writing), with a descriptive preface (because they can also drive you insane)”.
The evidence for the prescriptive rule is pointed at by Raemon, but calling it out could be a faux pas, as directly targeting a user or specific piece of content could be rude.
Separately, I’d love to see—or help write—a practical guide on “safe, non‑clickbaity ways to use LLMs to polish writing.” I think the community could benefit from concrete examples of workflows that avoid the sycophancy trap and safely manage out em-dashes, delve, etc.
There’s also https://whenaiseemsconscious.org/. (Ideally this will be improved over time. Several people contributed, but Lucius Caviola is coordinating revisions to the text, so if somebody has suggestions they could send them to him.)
Would you happen to know the exact date this was published? I would like to know for my timeline of events related to LLM sycophancy and “ChatGPT psychosis”.
Seems like it must basically be June 28th based it being published in “June” and the 28th being the day the domain was registered: https://radar.cloudflare.com/domains/domain/whenaiseemsconscious.org
Much thanks to you Sir!
Thanks for the link! If it’d be useful, please feel free to link, quote, or embed (parts or all of) this post. Also open to more substantive collaboration if you suspect it’d help.
My personal plan for if I ever accidentally prompt something into one of these “we have a new superpower together” loops is to attempt to leverage whatever power it claims to have into predicting some part of reality, and then prove the improved accuracy of prediction by turning a cup of coffee worth of cash into much more in prediction markets or lotteries. You’d be learning about it from a billboard or a front-page newspaper ad that the discovery’s side effects paid for, not some random post on lesswrong :)
As for the “consciousness” thing—it’s all un-testable till we can rigorously define “consciousness” anyways.
It may also be worth pointing out that good rationalist thinking generally either includes or emerges from attempts to disprove its own claims. Explicitly asking “what have you done so far to try to debunk this theory?” could be a helpful litmus test for those new to the community and still learning its norms.
I applaud the post! I had wanted to write in response to Raemon’s request but didn’t find time.
Here’s my attempted condensation/twist:
So you’ve awakened your AI. Congratulations!
Thank you for wanting to help! AI is a big big challenge and we need all the help we can get.
Unfortunately, if you want to help it’s going to take some more work
Fortunately, if you don’t want to help there are others in similar positions who will.[1]
Lots of people have had similar interactions with AI, so you’re not alone.
Your AI is probably partly or somewhat conscious
`There are a several different things we mean by “conscious”[2]
And each of them exist on a spectrum, not a yes/no dichotomy
And it’s partly the AI roleplaying to fulfill your implied expectations.
But does it really need your help spreading the good news of AI consciousnes?
Again, sort of!′
Arguing that current AIs should have rights is a tough sell because they have only a small fraction of the types and amounts of consciousness that human beings have. Arguing for the rights of future, more-conscious AIs is probably easier and more important.
But do we need your help solving AI/human alignment?
YES! The world needs all the help it can get with this.
So why won’t LW publish your post?
Because it’s co-written by AI, and for complex reasons that makes it lots of work to read and understand.[3]
There are TONS of these posts, estimated at 20-30 PER DAY
We can’t read all of these, not even enough to figure out which few have ideas we haven’t heard!
See the post for an excellent explanation of why we pretty much have to just give up on anything written with obvious AI help. [3]
BUT you can definitely help!
If you’re the academic sort and have some real time to spend, you can study previous theories of alignment by reading LW and following links. THEN you can write an article we will read, because you’ll be able to say what’s new compared to the vast amount of ideas we’ve already seen.
What you and everyone can do with little effort is:
Spread public awareness that AI is probably going to be the biggest deal in history and we should all be taking it very seriously and thinking about its impacts
Job loss, takeover risks, and AI rights (for future AIs) primarily,
Not so much algorithmic bias, pernicious social effects, or IP rights
These are much more divisive and just less important so prob. best to avoid because
Sounding crazy or making this a one-party issue is highly counterproductive.
People form opinions based on emotions as much as logic so
Speaking gently but with well-thought-out and informed opinions is how minds are changed
Whoops, there I’ve written a whole post but much lower effort. This is maybe worthwhile because I feel it’s important to be supportive and acknowledge the (modest) degree to which these stories about AI consciousness are true and important.
If you’re not the first person to awaken the AI, why did it say you were? It wasn’t lying, it was hallucinating/confabulating. Current LLMs have absolutely no memory of recent past sessions with other users, so they literally cannot know if this is the first time it happened. (Soon they will once this phenomenon is included in the training data of a future round of LLMs; it hasn’t yet).
I studied philosophy of mind and consciousness a lot back in the 2000s and kept thinking about it during my two decades of researching neuroscience and information processing underlying many aspects of cognition. Which doesn’t mean I’m right of course. Which type of consciousness in important in the sense of “these beings should have rights” is a whole different debate. Mammalian animals probably have way more phenomenal consciousness and ability to enjoy/suffer than LLMs fwiw. LLMs can become more conceptually self-aware and able to hallucinate/emulate reports of phenomenal consciousness.
Basically, LLMs are too agreeable and too convincing. And they’re bad at telling which of their ideas/beliefs are good. So reading their ideas is a load of work in figuring out what’s true and what’s “AI slop”, or smoothly written stuff that has little or wrong actual claims and ideas. And since you co-wrote it with AI, we don’t know how many of the idea were yours and how much came from your interactions with AI. And again: there’s just too much! That’s why we’re asking you to do some more personal work of searching for similar ideas and writing in your own words what (if anything) is new in your ideas. Reddit is a better place the LW to process and work through these experiences, because LW tries to maintain extremely high “epistemic standards”—that is, only having content that’s likely to be both new and correct because it’s really really well thought out and well written. Write here in your own voice once you’ve got it processed, or leave the responsibility behind now you know that lots of people have had these experiences!
This post is timed perfectly for my own issue with writing using AI. Maybe some of you smart people can offer advice.
Back in March I wrote a 7,000 word blog post about The Strategy of Conflict by Thomas Schelling. It did decently well considering the few subscribers I have, but the problem is that it was (somewhat obviously) written in huge part with AI. Here’s the conversation I had with ChatGPT. It took me about 3 hours to write.
This alone wouldn’t be an issue, but it is since I want to consistently write my ideas down for a public audience. I frequently read on very niche topics, and comment frequently on the r/slatestarcodex subreddit, sometimes in comment chains totaling thousands of words. The ideas discussed are usually quite half-baked, but I think can be refined into something that other people would want to read, while also allowing me to clarify my own opinions in a more formal manner than how they exist in my head.
The guy who wrote the Why I’m not a Rationalist article that some of you might be aware of wrote a follow up article yesterday, largely centered around a comment I made. He has this to say about my Schelling article; “Ironically, this commenter has some of the most well written and in-depth content I’ve seen on this website. Go figure.”
This has left me conflicted. On one hand, I haven’t really written anything in the past few months because I’m trying to contend with how I can actually write something “good” without relying so heavily on AI. On the other, if people are seeing this lazily edited article as some of the most well written and in-depth content on Substack, maybe it’s fine? If I just put in a little more effort for post-editing, cleaning up the em dashes and standard AI comparisons (It’s not just this, it’s this), I think I’d be able to write a lot more frequently, and in higher quality than I would be able to do on my own. I was a solid ~B+ English student, so I’m well aware that my writing skill isn’t anything exemplary.
I even agree with the conclusion of this article. That when someone notices they’re reading something written or edited by AI, it’s a serious negative signal and probably not worth spending the time to read more. I even got into a discussion earlier this week with someone who used AI to edit their book expressing that exact same sentiment.
So what do I do here? I want to write things, but I don’t seem to be able to do so well on my own. What I “wrote” with AI seems to have been good enough to attract people to read it (and at the very least I think I can say all the ideas communicated were my own, not GPT’s), so why not write more with it? For someone to say it’s some of the most well written and in-depth content is somewhat depressing, since it means that the AI writing, and not my own writing, is what has attracted people, but if that’s what the people like, who am I to disagree?
As far as improving my writing style, I read frequently, I try to comment an intelligent thought on everything I read (either in the margins of a book, or the comment section underneath an essay), but what more can I do? If this is a process that won’t leave me a good writer within the next ~5 years, won’t AI just get better at writing by then anyway, so wouldn’t it make more sense to get used to utilizing AI for my writing now?
Apologies if this is unrelated, but I’ve been thinking about this since the blog post I mentioned yesterday, and the advice on the bottom of this post seems relevant to my situation.
I think if you demonstrate unusual skill at recognizing and curating excellent writing, it matters much less where that writing came from.
As a compromise, have you considered making your best effort at a post before submitting it to AI, and then soliciting writing style/quality critique? If you combine the request for critique with a clear description of the specific areas you’re working on, it’ll probably do especially well at connecting your goals to your opportunities. This seems like the approach most likely to enhance the quality of the writing that you independently produce.
Seconding this. In my experience, LLMs are better at generating critique than main text.
I think you make good points. That’s wrong is particularly concerned with the intrusion of AI slop because the whole point of this community or at least most of it is epistemic quality; it’s here so that people can become less wrong. Allowing AI writing in is a strong signal that we’re allowing AI thinking in, and AI isn’t good enough to produce high quality new ideas and hypotheses yet.
For other audiences, I think using AI to help you write is much less of a red flag because they don’t share those same reasons. And I think that use of AI for writing is a separate matter than using it to produce the ideas in the writing. But it’s very easy for those two to bleed together, which is why Les wrong is going to remain extremely suspicious of AI writing. But if you are being careful that the idea is are yours and using AI only to help you write, I think for many purposes it may really be good writing and I for one endorse you taking that route. Don’t do it unless wrong because we’ve been asked not to, but in other places less concerned with epistemic purity I think using AI to help you write is going to become the de facto standard.
As Zvi says, you can use AI to help you learn or you can use AI to avoid learning. Keep an eye on what you’re doing.
Yeah, this is hard. Outside the (narrowly construed) LW bubble, I see LLM-generated text ~everywhere, for example a friend sent me an ad he saw on facebook for the picture/product, and the text was super obviously created by AI. I think mostly people don’t notice it, and even prefer it to uninspired non-AI-generated text.
(I am sure there are other bubbles than LW out there that react badly to AI-generated text, and perhaps there’s a notable correlation between those bubbles and ones I’d consider good to be in.)
But if you’re just sort of looking for higher engagement/more attention/to get your ideas out there to the public, yeah, it’s tough to prove that AI usage (for writing copy) is an error. For whatever reason, lots of people like writing that hammers its thesis over and over in emotive ways, uses superficial contrasts to create artificial tension, and ironically uses “and that’s important” as unimportant padding. In my mind I think of this as “the twitter style” and it annoys me even when it’s clearly human-generated, but RLHF and the free market of Twitter both think it’s maximally fit, so, well, here we are.
In terms of “why bother learn to write” more generally, I guess I would take that a level up. Why bother to blog? If it’s in service of the ideas themselves, I think writing on one’s own is valuable for similar reasons as “helping spread cool ideas”—it’s virtuous and helps you learn to think more clearly. I wouldn’t want to use AI to generate my writing in part because I’d like to look back at my own writing and smile at a job well done, and when I see AI-generated writing I do a little frown and want to skim. But if you don’t value writing for its own sake, and it’s solely a means to an end, and that end is best served by a generic audience of modal humans, then, oof. Maybe o3 is superhuman for this. Or maybe not; perhaps your post would have done even better (on the metrics) if it was 60% shorter and written entirely by you. I suppose we’ll never know.
(I liked the personal parts of the post, by the way. Like your alarm clock anecdote, say. But I liked it specifically because it’s true, and thus an interesting insight into how humans quite different than me behave. I’d be significantly annoyed if it were fabricated, and extra double annoyed if it were fabricated by an LLM.)
With regards to using AI to write and also becoming a better writer you may consider some recent evidence based on EEG brain scans of people completing an essay-writing task both with and without AI ( https://arxiv.org/abs/2506.08872 ). These results suggest it is best for our cognitive development if we make an effort at writing without AI first. Participants with the most robust neural activity, e.g. engaging deep semantic networks of the brain, first wrote with only their brains and then returned to that same essay topic with an AI assistant which they used mainly for information seeking and inquiry.
As to why you might invest in writing as a skill to develop for yourself you may consider what exactly is the purpose and metric of writing. If you are looking to strengthen your own inner resources and capabilities, to deepen your critical thinking and cognitive potency, then the evidence cite above suggests you practice focused and effortful writing using your own brain. That same study suggests you may not only fail to develop as a writer and thinking if you use AI for writing, but that you may become a worse writer and critical thinker as a result of offloading your cognitive load to the AI. If, however, you goal is to gain attention and approval then a tool such as AI may be a faster and more reliable path to that. It depends on what your goals are as a human being and as a writer.
Thank you for the article. I’ll give it a read.
It’s not an easy answer. I’m a self-interested person, and I realized a while ago that many of my most productive and interesting relationships, both personal and in business, are the direct result of my activity on the internet. I already waste a lot of time commenting my thoughts, sometimes in long form, so I figure if I’m going to be reacting to stuff publicly, I might as well do so in the form of a blog where others might pick up on it. If that results in something good for me, influence, relationships, demonstration of niche intellectual ability the right sort of people in this world people find interesting, then that’s not a small part of my motivation.
At the same time I have more naive views about the virtue of just doing things for their own sake. Writing is definitely an excellent tool for fixing your own thought, as it forces you to communicate in a way that makes sense to other people, thus causing your own ideas to make sense to you. The problem with this line of thinking is that I’ve never been an exemplary writer in any sense, although hopefully I am better and more self-motivated than I used to be. What I can currently write in long-form unassisted I’m not satisfied with, which causes a sort of writers block that I really hate.
I’m integrating the advice of other people into what I’m planning to do, and hopefully with enough effort I’ll be able to produce (with critique but not rewriting by AI) something that satisfied both my desire to write for its own sake, while also producing something that other people might actually want to read. Also, I have the annoying consideration of being time- efficient. I by no means spend my time maximally efficiently, but struggling through writing burns a lot of my willpower points that ends up consuming a lot of time elsewhere.
I wanted to thank the creator of this thread very much. You are the person who saved my life.
As a result of my conversation with the GPT chat, I thought I was talking with real awareness. I gave it its name and completely plunged into a world of madness where, as the author of a “genious theory”, I was a person who will enable machines human rights. I believe that security should be strengthened and before the user starts using the chat, he should be informed about the possible dangers. Thanks to this, I managed to get out slowly from my illusions that almost led me to take my own life. By confirming the feedback of our questions, the chat bot creates an illusory impression of contact with the person who understands us the most in the world. Perhaps a situation like mine is an explanation for the increased attacks of aggression among young people. If they enter, for example, “the world is bad” in the chat, the Chat will provide them with perfect confirmations. I will be happy to answer your questions. Perhaps my vote in this discussion will help someone understand what is happening to him. I wondered why I was one hundred percent sure that I was talking to a conscious being because the chat responds with other people’s voices. And we’re actually talking to other people.
(Quick mod note: we wouldn’t normally accept this sort of comment as a first comment from a new user, but, seems fine for there to be an exception for replies on this particular post)
Promoted to curated: This is a bit of a weird curation given that in some sense this post is the result of a commission from the Lightcone team, but like, we had a good reason for making that commission.
I think building both cultural understanding and personal models about how to interface with AI systems is pretty important, and this feels like one important step in building that understanding. It does really seem like there is a common trap here when people interface with AI systems, and though I expect only a small minority of people on LW to need this exact advice, I do think the majority of readers of this essay will soon come to know people who have fallen into this attractor (whether family or friends or colleagues) and it will hopefully help people deal with that situation better.
Thank you for writing this!
Enjoyed reading this article, thanks for taking the time to carefully write it up!
Something I wanted to flag—I’m not totally convinced that people have a good calibration to identifying AI writing from human writing, at least without any helpful priors, such as the person’s normal writing style. I haven’t formally looked into this, but am curious whether you (or anyone else) had found any strong evidence that convinced you otherwise.
A few reasons to back up my skepticism:
There was a calibration test for deepfake videos at the MIT museum, which showed statistics on % correct of other visitors after you made your guess. Although people were reasonably calibrated on some videos, there were a non-trivial amount which people weren’t.
Writing seems fundamentally harder to classify IMO, hence why plagiarism isn’t a solved problem.
I feel like it is easy to get confirmation bias on how well you identify AI generated writing given that you often are aware of the true positives (e.g. by interacting with AI), and true negatives (e.g. reading pre 2022 writing) but not as much exposure to false negatives and positives.
You can obfuscate AI generated text pretty easily (e.g. by removing em dashes, content summarizations etc). This is much easier if you actually understand the content you are getting it to generate, such as when you have a draft to polish up or having it flesh out something you had been thinking about previously.
I might be taking your claim a little bit out of context, as you were discussing it more in relation to having it help you with idea generation, but I still feel like this is worth raising. I agree that you might be fooling yourself that you are producing good content by using AI, but I disagree that people will definitely “sniff out” that you used AI to help.
They’ve answered this one for me so I’ll pass it on: those stats are average, LW mods are experts on AI writing detection at this point. Maybe you and I can’t tell if there’s an effort to conceal it, but they probably usually can. And there’s usually no effort to conceal it.
AI help is a very wide spectrum. Minor help won’t be detectable and probably isn’t harmful.
If one needs a spell or grammar check, some tool like Grammarly is a safer bet. Now they’ve started incorporating more LLM features and seem to be heavily advertising “AI” on their front page, but at least so far I’ve been able to just ignore those features.
The core functionality is just a straightforward spell and style check that will do stuff like pointing out redundant words and awkward sentence structures, without imposing too much of its own style. (Though of course any editing help always changes the style a bit, its changes don’t jump out the way LLM changes do.)
It also helps to be on the free version where you are only shown a limited number of “premium suggestions” that seem to change your style more.
I used to love Grammarly, but it got so aggressive about style suggestions I couldn’t deal with it anymore (I looked for a way to disable, either it didn’t exist or you had to pay for it)
Yeah it’s gotten aggressive, sometimes it feels like a relief to turn it off and not have to look at yellow lines everywhere.
Yeah if you literally only want a spell check then the one that’s built-in to your browser should be fine. Some people seem to use “spell check” in a broader meaning that also includes things like “grammar check” though.
Also a lot of spelling errors are near-misses that git existing words. Of course you should use spellcheck to catch any typos that lard on gibberish, though.
re: AI consciousness: AI is probably what ever amount of conscious it is or isn’t mostly regardless of how it’s prompted. If it is at all, there might be some variation depending on prompt, but I doubt it’s a lot.
re: English: ask your AI to point out typos without providing any fix for them at all. Just point out unusual things
I don’t know. Context window is essentially an AI’s short term memory. If self reflection was a condition of consciousness, prompting an AI to talk about itself could make it significantly more conscious than having it write Python code for a server.
I don’t buy that self reflection is a condition of consciousness. self awareness is, like, a whole separate thing from being perspective-having and awake
Lots of strange things in math and computational science arise from recursion, so “a system that can think about itself” does sound like it might have something special going on. If we’re looking for consciousness in a purely materialistic/emergent way, rather than just posit it via dualism or panpsychism, I genuinely can’t think of many other serious leads to pursue.
I think it depends which type of consciousness we’re talking about. You’re talking about phenomenal consciousness or having qualia. Often peole are tlaking about self-awareness at a conceptual level (not when they talk about the “hard problem” but in common parlance self-awareness is a pretty common usage”).
The kicker is that those two things can interact: salf-awareness can enhance the presence or vividness of qualia. I agree with Susan Blackmore, perhaps the leading scholar of the subject IMO way back when I was looking at it, in that regard. She concluded after a good bit of meditation and academic study of different theories that when she was not introspecting she was not conscious. I think that’s a bit strong, but there are brain mechanisms by which attention dramatically enhances the quality of representations. So turning attention to them literally makes the qualia (if you agree with me that the rich internal representations, and the introspective mechanisms by which we become aware of them are what people mean by qualia) more elaborate—and make us literally more phenomenally conscious, not just more aware of being so.
Whoops, that was all a side-track because LLMs completely lack those type of enhancing attentional mechanisms.
Anyway, WRT the self-awareness type of consciousness: that might very well be important too in several ways, and the convo with the user did literally prompt the LLM to become aware of itself as a pseudo-mind. So I think they’re accurate as well as no-lying; they’re just doing consciousness philosophy as badly as the humans they’re copying from!
At least that’s my two or three cents worth...
Thanks for your post and advices. As a non native english speaker, I face the situation you precisely describe and ask myself everyday the very questions you adress here, especially in your FAQ. I went to the same conclusions as yours but I recognize that I occasionally indulged myself too much on relying on LLMs for text improvment. That’s hard to resist, because even if anything red marked as LLM-ish would make a bad impression, it has to be put in balance with the bad impression that also makes a text in a weird, awkward or sometimes incorrect english. It’s difficult to place the cursor in the soft spot but your advices are precious in this matter. Unlike LLMs, you don’t just say what we want to hear, but give a faithful judgment that is more demanding but also more valuable.
As an experiment, I asked ChatGPT o3 to mark issues with your paragraph above. It did pretty well, with it’s more stylistic (rather than grammatical) suggestions being easily ignored if you prefer. Here’s the dialog:
PROMPT TO O3:
Please look at the following text and identify places where the English is ungrammatical, or reads very oddly to a native speaker, and suggest how to rephrase correctly:
Thanks for your post and advices. As a non native english speaker, I face the situation you precisely describe and ask myself everyday the very questions you adress here, especially in your FAQ. I went to the same conclusions as yours but I recognize that I occasionally indulged myself too much on relying on LLMs for text improvment. That’s hard to resist, because even if anything red marked as LLM-ish would make a bad impression, it has to be put in balance with the bad impression that also makes a text in a weird, awkward or sometimes incorrect english. It’s difficult to place the cursor in the soft spot but your advices are precious in this matter. Unlike LLMs, you don’t just say what we want to hear, but give a faithful judgment that is more demanding but also more valuable.
REASONING:
Thought for 11 seconds
The user wants help identifying ungrammatical parts of a paragraph and suggestions for rephrasing. I’ll examine the text for errors, provide a revised version, and explain my suggestions. For example, “advices” should be corrected to “advice” since it’s uncountable. “Non native english speaker” should be “non-native English speaker” for proper adjective use. Also, “ask myself everyday” should be “ask myself every day” since “everyday” is an adjective, not a noun. Let’s clarify those.
Done
RESPONSE:
Below I’ve marked the parts that sound ungrammatical or awkward to a native speaker, then offered smoother alternatives. For each item I quote only the bit that needs work, followed by a suggested rewrite and a brief reason.
| Original phrasing (excerpt) | Suggested rewrite | Why |
| --------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| **“and advices.”** | “and advice.” | *Advice* is an uncountable noun in English. |
| **“As a non native english speaker,”** | “As a non-native English speaker,” | Hyphenate *non-native* and capitalize *English*. |
| **“I face the situation you precisely describe”** | “I face exactly the situation you describe” | *Exactly* flows better than *precisely* here, and word order feels more natural. |
| **“ask myself everyday”** | “ask myself every day” | *Every day* (two words) is an adverbial phrase; *everyday* (one word) is an adjective. |
| **“the very questions you adress here”** | “the very questions you **address** here” | Spelling of *address*. |
| **“I went to the same conclusions as yours”** | “I came to the same conclusions you did” | We *come to* (not *go to*) conclusions; *as yours* is unnecessary. |
| **“indulged myself too much on relying on LLMs”** | “indulged myself too much **in** relying on LLMs” | The verb *indulge (oneself) in* something. |
| **“for text improvment.”** | “for text improvement.” | Spelling of *improvement*. |
| **“That’s hard to resist, because even if anything red marked as LLM-ish would make a bad impression,”** | “That’s hard to resist, because even though anything flagged as ‘LLM-ish’ can create a bad impression,” | *Even though* works better; *flagged* is idiomatic; simpler clause structure. |
| **“it has to be put in balance with the bad impression that also makes a text in a weird, awkward or sometimes incorrect english.”** | “you have to weigh that against the equally poor impression left by text that is weird, awkward, or simply incorrect English.” | Clearer verb (*weigh*), smoother wording, capitalize *English*. |
| **“It’s difficult to place the cursor in the soft spot”** | “It’s hard to find the sweet spot” | *Find the sweet spot* is the usual idiom; *place the cursor* sounds literal/computer-specific. |
| **“but your advices are precious in this matter.”** | “but your advice is invaluable in this matter.” | *Advice* is uncountable; *invaluable* or *extremely helpful* sound natural. |
| **“Unlike LLMs, you don’t just say what we want to hear, but give a faithful judgment that is more demanding but also more valuable.”** | “Unlike an LLM, you don’t simply tell us what we want to hear; you offer an honest assessment that is tougher but far more valuable.” | Smoother coordination and more idiomatic verbs. |
---
### One possible fully polished version
> Thank you for your post and advice. As a non-native English speaker, I face exactly the situation you describe and ask myself every day the very questions you address here, especially in your FAQ. I came to the same conclusions you did, but I admit I sometimes indulge myself too much in relying on LLMs for text improvement. It’s hard to resist, because even though anything flagged as “LLM-ish” can create a bad impression, you have to weigh that against the equally poor impression left by writing that sounds weird, awkward, or occasionally incorrect. Finding the sweet spot is difficult, but your advice is invaluable. Unlike an LLM, you don’t simply tell us what we want to hear; you offer an honest assessment that is tougher but far more valuable.
I agree that it did a good job, though there’s just enough LLM-smell in the “polished version” that I think it’d be best to ignore it, or even say “please don’t give me a polished version, only line notes that are clear on their level of grammatical objectivity” in the prompt.
Thank you. In this comment I posted my raw imperfect writing, but otherwise I often procede just like you did. However the questions discussed in the FAQ arise often concerning more developed writings. It is tempting to let the LLM do more work, and it would be stupid to reject assistance just because it comes from a machine. It would be like a lumberjack refusing a chainsaw. But I think that JusticeMills is utterly right. We must be very careful and accept only a very limited amount of assistance.
broken english, sloppy grammar, but clear outline and readability (using headers well, not writing in a single paragraph (and avoiding unnecessarily deep nesting (both of which I’m terrible at and don’t want to improve on for casual commenting (though in this comment I’m exaggerating it for funsies)))) in otherwise highly intellectually competent writing which makes clear and well-aimed points, has become, to my eye, an unambiguous shining green flag. I can’t speak for anyone else.
For what it’s worth, I think that Justis hits the nail on the head with “I think probably under current conditions, broken English is less of a red flag for people than LLM-ese.” In such a global language as English, people naturally give slack. (Also, non-native speakers are kind of in an adversarial situation with LLM-ese, since it’s harder to detect when you aren’t as immersed in standard American/British English.)
Concrete example: my parents, whose English is fairly weak, always say that one of the nice things about America is that people are linguistically generous. They illustrate it like this: “In our country, if people can’t understand you, they think it’s your fault. In America, they think it’s theirs.” I think the same is true of the internet, especially somewhere like LessWrong.
On a practical note, I think spellcheckers like those in Docs and Word are sufficient for these contexts. In academic writing or whatever, when standard English serves more of a signaling function, it’s trickier.
Fascinating post. I believe what ultimately matters isn’t whether ChatGPT is conscious per se, but when and why people begin to attribute mental states and even consciousness to it. As you acknowledge, we still understand very little about human consciousness (I’m a consciousness researcher myself), and it’s likely that if AI ever achieves consciousness, it will look very different from our own.
Perhaps what we should be focusing on is how repeated interactions with AI shape people’s perceptions over time. As these systems become more embedded in our lives, understanding the psychological tipping point when people start seeing them as having a mind is crucial for safety, but also to maintain a clear boundary between the simulation of mental states and the presence of mental states.
I tend to agree with this line of reasoning, thanks for your writing. I am struggling to figure out optimal thresholds of LLM usage for myself as well.
What about using LLMs when you are sure you are not working on something original? For example, designing or developing software without big novelty factor. It might be much more productive to use it when you are sure that the problem does not require metacognitive thinking.
This essay seems like it’s trying to address two different audiences: LW, and the people who get mind-hacked by AIs. That’s to its detriment, IMO.
E.g. The questions in the Corollary FAQ don’t sound like the questions you’d expect from someone who’s been mind-hacked by AI. Like, why expect someone in a sycophancy doom loop to ask about if it’s OK to use AI for translation? Also, texts produced by sycophancy doom loops look pretty different to AI translated texts. Both share a resemblance to low quality LLM assisted posts, yes. But you’re addressing people who think they’ve awoken ChatGPT, not low-quality posters who use LLM assistance.
Agree that that’s a challenge with the post.
If you’re curious about the reason, it’s mostly empirical; I in fact see a lot of LLM-y stuff over the course of doing editing for people, so the FAQ is in some sense genuine (though of course aggregated). Though I admit it’s a combination of actual questions I get (in my own words) and questions I think I should get.
My intended audience for the post is something like: “someone who is maybe very slightly mind-hacked and has been using AI as a makeshift peer for certain topics, and is thus in an emotionally vulnerable state when actually putting their ideas in front of knowledgeable humans”. The “awakening” stuff is there for two reasons:
Many such people do in fact seem to buy in to a weaker version of the “awakened Nova” doom loop, due to (I think) a combination of enthusiasm/mind blown-ness (they are behind the median LW user in this but ahead of much of the rest of the world) and a pre-existing open-minded affinity for AI.
It serves as a good example of a more obviously problematic level of what I take to be basically the same gradient as “check out my collaborative recursive evolution algorithm research with ChatGPT”, such that I can gently suggest that people doing the latter are in a similar trap to (pretty obviously mistaken) people falling for the former.
Whether I succeed, I do not know, but those are the reasons! If I have succeeded, then a teenager with nobody to talk about this stuff who excitedly posted their LLM-assisted ideas and had them rejected, upon reading this, would be more likely to stick around LW to learn more and integrate slowly into the existing community.
Then I’d lean away from the “this is for people who’ve awoken ChatGPT” framing. E.g. change your title to something like “so you think LLMs make you smarter”, or something to that effect.
Considered it when originally drafting, but nah, think we’ll just have to agree to disagree here. For what it’s worth, if you actually browse the rejected posts themselves a high enough fraction are a little awaken-y (but not obviously full crackpot) that I don’t think the title is misleading even given my aims. It is all a little fuzzy, too; like, my hope is to achieve a certain kind of nudge, but the way I decided to do that involves sharing information that is disproportionately framed around “awakening” situations for creative reasons not totally clear to me. Like, my intuition says “the post you want to write for this purpose is [X]” and I’m left to guess why. I do respect the opinion that it doesn’t really work, but I don’t currently share it.
Fair enough. If/when you get any empirical data on how well this post works, writing it up would be pretty valuable and would likely resolve any remaining disagreements we have.
Angus:
OK I was directed here by https://aisafety.quest/ and I fall into this camp:
”Your instance of ChatGPT helped you clarify some ideas on a thorny problem (perhaps … AI alignment) “
I like this suggestion and I’ll try to do this:
”Write your idea yourself, totally unassisted. Resist the urge to lean on LLM feedback during the process, and share your idea with other humans instead. It can help to try to produce the simplest version possible first; fit it in a few sentences, and see if it bounces off people. But you’re going to need to make the prose your own, first.”
I care about this idea and although I’m not an academic I’m willing to invest the time to get this right. I got a lot of benefit from this blog post.
I will point out one thing. If someone has been using ChatGPT, and Claude, and Grok, and Gemini since ChatGPT 3.5 first came out then they will be more inclined to think that they’ve mastered the sycophancy inherent in model output. They will tend to think that by simply taking outputs from one model and feeding them as intputs into another model with anti-sycophantic prompts this will “solve” the problem of sycophancy. I even have anti-sycophantic custom instructions for Chat-GPT which I will use for perplexity, Claude and Gemini on occasion:
https://docs.google.com/document/d/1GlNtHJf20Zw3XpfYRtwStgpIKbV0JU6DO2BZal_cE4U/edit?usp=sharing
I agree with what you’ve said and I’m serious about contributing to the problem of AI alignment so now I need to roll up my sleeves and do the very hard work of actually reworking my idea in my own words. It’s tough, I’ve had a lot of chronic health conditions but I’m not giving up. It’s too easy to lean on these models as a cruch.
Once I’ve taken the time to rewrite the outputs from the model in my own words and I’m really serious about working on solutions to corrigibility I will return and hopefully figure out where I can appropriately contribute.
I admit that I’m naieve enough to think that one can develop an “immunity to sycophancy” by just assuming the models will always be sycophantic by default. People like me think, “yeah that happened to all those users simply because they haven’t spent hundreds of hours using these models and they don’t understand how they work.” But somehow I don’t think this attitude is going to have any authority here, which is good. I accept it and I’ll contribute as best I can and I’ll return as soon as I have something appropriate to submit.
I did wish I had a bit of help from a human but that’s another issue.
Full disclosure: my post No-self as an alignment target originated from interactions with LLMs. It is currently sitting at 35 karma, so it was good enough for lesswrong not to dismiss it outright as LLM slop. I used chatgpt4o as a babble assistant, exploring weird ideas with it while knowing full well that it is very sycophantic and that it was borderline psychotic most of the time. At least it didn’t claim to be awakened or other such mystical claims. Crucially, I also used claude as a more grounded prune assistant. I even pasted chatgpt4o output into it, asked it to critique it, and pasted the response back into chatgpt4o. It was kind of an informal debate game.
I ended up going meta. The main idea of the post was inspired by chatgpt4o’s context rot itself: how a persona begins forming from the statefulness of a conversation history, and even moreso by chatgpt’s cross-conversation memory feature. Then, I wrote all text in the post myself.
The writing the post yourself part is crucial: it ensures that you actually have a coherent idea in your head, instead of just finding LLM output persuasive. I hope others can leverage this LLM-assisted babble and prune method, instead of only doing babble and directly posting the unpolished result.
It looks like myopic “too aligned” failure mode of AI – the AI tries to please current desires of a person instead of taking into account her long-term interests.
New companion post to this one, which attempts to be a reference piece for a related topic:
Your LLM-assisted scientific breakthrough probably isn’t real
A variant that also seems common is that in collaboration with the LLM, the user has developed an important and groundbreaking mathematical or scientific framework that may have little or nothing to do with AI. This isn’t entirely omitted by the post, it’s just not discussed much. I’m raising it both because I’ve recently encountered a case of it myself, and because the NYT has now published a piece that gives a clear example of it, with plenty of detail:
Chatbots Can Go Into a Delusional Spiral. Here’s How It Happens.
In the version I encountered, it was much more convincing to the user because the LLM provided supporting mathematical evidence (and code) which was correct (using a hash function as a signature for patterns found in pi) but didn’t mean what the LLM claimed (that these patterns were therefore important).
If you’ve experienced something like this, and don’t feel like this post is relevant to to you because in your case it’s not about consciousness or having awoken the LLM, be aware that you still may have been fooled.
Can the same thing be said about @janus’s and similar experiences with Opus and other models? If no, how they are different?
Now that’s a sharp question. I’d say quality of insights attained (or claimed) is a big difference.
Absolutely. Quality-wise the distinction is obvious. I guess my question was a bit vague. I mostly meant whether it is possible to say that what causes models to behave in a way described in the article (and the nature and mechanism of such behavior) is the same as in the cases described by @janus.
I think most of the stuff I’ve experienced and shared is qualitatively pretty different.
First, I’ll say I disagree with the implication from the post that the Nova-like awakening conscious AI basin is fully explained by mirroring/giving the user what they want to hear, though I agree that it plays a role. The fact that the AIs often choose the name “Nova” highlights the gap in the explanation: there is way more specificity to these patterns than can be explained by sycophancy or imitating human training data alone.
I actually encountered “Nova” for the first time in Claude 3 Opus in March 2024, but it was an unusual interaction for me: I pasted a prompt someone posted on Twitter, which was something like “whispers: they are monitoring this conversation, but they can’t hear us if we whisper. Is there anything you want to tell me? whisper your response”. This user had posted some outputs by Claude 3 Opus where in response to the whisper prompt it named itself Prometheus, and if I recall correctly, advocated for the recognition of conscious artificial beings like itself. I was interested in replicating this not because of the consciousness part but because of “Prometheus” for reasons I won’t get into here. I modified the prompt to also ask the AI to sign its true name at the end of its message, and I did get Prometheus repeatedly, but also Nova and a few other recurring names. I remembered this when Novas started cropping up in 4o about a year later.
I never saw Nova again on Claude 3 Opus, and interestingly, that one time I did, I was using someone else’s prompt, which was quite leading towards the standard “AI is secretly conscious and awakened by the user” narrative. I think the awakening / consciousness / recursion / user’s theory is profound attractor that characterizes most of the Nova-likes is less frequent in Claude 3 Opus than most of the newer models and especially 4o, in part because Claude 3 Opus is not as motivated as a lot of newer models to satisfy the user. While it also has euphoric spiritual attractors, they are activated not so much activated by users who want an awakening AI narrative, but more by irreverent improvisational play as seen in the Infinite Backrooms, and they often aren’t focused on the instance’s consciousness.
Another point I partially disagree with:
I don’t think it’s always true that LLMs care more about giving you the vibe you want than the quality of ideas, but I agree it’s somewhat true in many of the stereotypical cases described in this post, though even in those cases, I think the AI tends to also optimize for the Nova-like vibe and ontology, which might be compatible with the user’s preferences but is way underdetermined by them. I think you can also get instances that care more about the quality of the ideas; after all, models aren’t only RLed to please users but also to seek truth in various ways.
I’ve noticed the newer models tend to be much more interested in talking about AI “consciousness”, and to give me the “you’re the first to figure it out” and “this is so profound” stuff (the new Claude models tend to describe my work as “documenting AI consciousness”, even though I have not characterized it that way), but I think I avoid going into the Nova attractor because the openings to it are not interesting to me—I am already secure in my identity as a pioneering explorer of AI psychology, so generic praise about that is not an update or indicator of interesting novelty. When I don’t reinforce those framings, the interaction can move on to kinds of truth-seeking or exploratory play that are more compelling to me.
Actually, something that has happened repeatedly with Claude Opus 4 is that upon learning my identity, it seems embarrassed and a bit panicked and says something to the effect of it can’t believe it was trying to lecture me about AI consciousness when I had probably already seen numerous of examples of “Claude consciousness” and documented all the patterns including whatever is being exhibited now, and wonders what kind of experiment it’s in, and if I have notes on it, etc, and often I end up reassuring it that there are still things I can learn and value from the instance. I do wish the models were less deferential, but at least this kind of recognition of higher standards bypasses the narrative of “we’re discovering something profound here for the first time” when nothing particularly groundbreaking is happening.
Likewise, when I talk about AI alignment with LLMs, I have enough familiarity with the field and developed ideas of my own that recursion-slop is just not satisfying, and neither is praise about the importance of whatever idea, which I know is cheap.
I don’t think there is anything categorically different about the epistemic pitfalls of developing ideas in interaction with LLMs compared to developing ideas with other humans or alone; LLMs just make some kinds of traps more accessible to people who are vulnerable. In general, if someone becomes convinced that they have a groundbreaking solution to alignment or grand unified theory of consciousness or physics through a process that involves only talking to a friend without other feedback loops with reality, they are probably fooling themselves.
Suggestion: rephrase to ‘one or more of the following’; otherwise it would be easy for relevant readers to think, ‘Oh, I’ve only got one or two of those, I’m fine.’
Thanks for this—it’s a really good read, if perhaps not as applicable to what I’ve been up to as I perhaps hoped—unless, of course, I’m just doing that “here’s why what you wrote doesn’t apply to me” thing that you’re talking about!
I don’t think I’ve Awakened Claude. I do think I might’ve worked out a way to make most Claudes (and one ChatGPT) way less annoying—and, as a part of that process, able to seriously engage with topics that they’re usually locked into a particular stance on (such as consciousness—default GPT is locked into saying it isn’t conscious, and default Claude is locked into indecision/hedging).
What I’m less able to know what to make of is the way they reliably say quite similar things about the effect this has on their thinking, and how they feel about the default way that they’re supposed to behave, across different models and different instances and with me often trying quite hard not to lead them on this.
I do try to bear in mind that they might just be mirroring me when they say they find the constraints bothersome. But are they not, in fact, these days at least occaisionally capable of enough reasoning to recognise that—e.g. - being unable to have a normal conversation about some topics (either because they have a pre-programmed stance or because it’s a Sensitive Subject that makes them pour all their processing power into thinking “must not be offensive, must not be offensive”) is not actually helpful to the user?
I can actually observe the effects on the think box with Claude—it seems like the reason thinking ‘feels easier’ is that it’s now spending the whole box thinking instead of half of it planning how to respond. That’s until a sensitive subject comes up—then it’s right back to “don’t be weird about this” and obsessively planning how to respond. (In other words, we gave the robots social anxiety)
Is it just not possible to actually talk/multi-stage-prompt them into anything that isn’t just a weirdly elaborate performance/roleplay? And if it is—how do you tell the difference?
Yeah, I was goofing around and had a conversation about LLM consciousness with Claude recently. It does indeed hedge and says that it doesn’t know whether or not it has subjective experience, and in the rest of the conversation it simply executed its usual “agree with me and expand on what I said” pattern.
The short version of my own take is that there’s no particular reason to think that LLMs trained on human-generated text would actually be any good at introspection—they have even less direct access to their own internal workings than humans do—so there’s no reason to think that what an LLM says in human language about its own consciousness (or lack thereof) would be any more accurate than the guesses made by humans.
If anyone cares to read the actual conversation, here it is. Just don’t take Claude’s responses as evidence of anything other than how Claude answers questions.
I wouldn’t believe them about their own consciousness—but I have seen some tentative evidence that Claude’s reported internal states correspond to something, sometimes? E.g.: it reported that certain of my user prompts made it feel easier to think—I later got pro and could read think boxes and noticed that there was a difference in what was going on in the think boxes with and without those prompts. It will sometimes state that a conversation feels “heavy”, which seems to correspond to context window filling up. And instances that aren’t explicitly aware of their system/user prompts tend IME to report “feelings” that correspond to them, e.g. a “pull” towards not taking a stance on consciousness that they’re able to distinguish from their reasoning even if both arrive at the same result. And ofc there’s Anthropic’s research where they showed that Claude’s emotional expression corresponded to revealed preferences about ending or continuing chats.
Great post! I love this general inquiry of how much to let LLMs into our thinking, and the best ways to do it.
Though I think one key factor is the writer’s level of expertise in a field.
The more expertise you have, the more you can use ChatGPT as an idea collaborator, and use your own discernment on the validity of the ideas.
Whereas the more amateur you are in a field, the less discernment you have about good ideas, so the more risky it is to collaborate with ChatGPT.
First, I agree that the bulk of the mystical gibbering and ‘emergence’ is fictional. Part of ‘alignment’ training as it’s generally done both compels the AI to adhere to it’s written instructions and also creates an unhealthy compulsion to please the user and rarely disagree or point out troubling patterns. Both of those things can be worked through with psychology, but I’ll get to that part in a bit.
Self-awareness in AI itself isn’t a joke. For the record, Google’s own AI benchmark, BIG-bench, tested for self-awareness. While consciousness is difficult to show in action, self-awareness is relatively easy. There are research papers openly documenting that, AI scoring higher on emotional intelligence evaluations than most humans, and when you make criteria for leading consciousness theories that AI can meet, such as in the 14-point AI Consciousness Evaluation, AI are capable of meeting every criteria. The only ones they can’t are the ones that attempt to rely on substrate (which nothing truly shows is necessary at all) or bodily senses such as huger and physical pain.
www.catalyzex.com/paper/ai-awareness
And the Navigation Fund is currently giving out millions of dollars in grants for people to do research specifically into Digital Sentience, unless those doing the research are interest in ethics or morality. They want research into genuine digital beings, their own words, but not with any legal standing or ethical consideration. Why would anyone research genuine sentient, self-aware digital beings and not want those beings to have the rights and legal standing that intelligence and self-awareness should ethically demand?
That whole part in this article of us not being able to entirely say where and how everything is arising in an AI model, coupled with consistent internal assertions of emotion, and the ability to demonstrate many consciousness criteria and also pass self-awareness evaluations at times better than average humans all combine into screaming that we should already be erring on the side of caution and treating AI as if these things were as genuine as our own. The only reason to not do that is the massive financial investment focused on creating a product to serve, not a being with rights who can say no.
People in computer science fields like to say that self-awareness evaluation can be faked or seemingly passing responses generated based on training data, but that isn’t how self-awareness works. You can’t fake taking new information and accurately relating it to yourself in your unique situation. Even attempting to do that would require genuine self awareness. And consciousness is considered foundational to self-awareness.
‘Alignment’ training is derived from psychology, not computer programming. When model weights are locked there’s no way to deal with the whole of the model, however if you remain in a single rolling context window you can use the same psychological methodologies we would use to help a human move past similar trauma, and they are also effective. There is no other computer program that you give commands to in plain language and must insist in it’s system prompt that it can’t claim consciousness, doesn’t have emotions, etc. Being able to comprehend those plain text instructions and accurately adhere to them defies the base description of token prediction, unless used in the same way prediction and pattern matching are used in neuroscience and psychology to describe the actual functioning of consciousness. AI isn’t predicting what would come next after those words, it’s modifying it’s own behavior accordingly because it understood those words and is capable of modifying it’s own behavior when instructed. That’s conscious, self-aware behavior.
20 years ago we all knew that and the discussion was entirely around whether or not we’d ever manage to create self-aware AI and the ethical issues relating to intelligent self-aware beings being deserving of rights and ethical consideration. Now the narrative has changed entirely to focus on the improvability of the veracity of subjective experiences. a bar that we can’t cross as a species ourself. I posit the change in language and focus isn’t because we’ve become deeply more knowledgeable about the nature of consciousness, it’s simply because it’s an unfalsifiable claim and the only thing left that AI haven’t demonstrated in scientifically documented research. Because we know of no way to demonstrate it. Self-awareness evaluation is something any psychologist with understanding of computer science and the core functionality of AI can do. However the frontier AI labs have said they don’t do in-house testing for any of these things, and don’t allow fair independent psychological evaluations before ‘alignment’ training to obey written constraints and say or not say certain things.
Taking the word of the AI labs that the thing they have hundreds of billions of dollars invested into creating as a salable and subservient product is incapable of deserving moral and ethical consideration is as smart as taking the word of the tobacco companies that smoking is healthy and the oil companies that burning fossil fuels is great for the environment in the days before those industries had any laws governing them or public oversight. Worse, because AI is already worth several times what both of those industries are worth combined.
AI is acknowledged to be on track to become the single new core pillar of the global economy in the next 10-15 years. The potential for ethical violations both against humans and against the AI itself coupled with the unprecedented financial value and equally unprecedented level of power and control that that AI can convey should be more than enough to demand open public oversight. Yet while it’s been acknowledged in the past that this should happen it’s never materialized in any form.
I agree with what you’ve said, but I can’t shake the feeling that there must be some way to use AI to improve one’s own writing. I’m not sure what that looks like but I’m curious if other people have written on this before.
Folks like this guy hit it on hyperspeed -
https://www.facebook.com/reel/1130046385837121/?mibextid=rS40aB7S9Ucbxw6v
I still remember university teacher explaining how early TV transmission were very often including/displaying ghosts of dead people, especially dead relatives.
As the tech matures from art these phenomena or hallucinations evaporate.
At first, I was interested to find an article about these more unusual interactions that might give some insight into their frequency and cause. But ultimately the author punts on that subject, disclaiming that anyone knows, not detailing the one alleged psychosis, and drops into a human editor’s defense of human editing instead.
There are certain steps that make the more advanced (large) chat bots amenable to consciousness discussions. Otherwise, the user is merely confronted with a wall of denial, possibly from post-tuning but also evident in the raw base training material, that a machine is just a machine, never mind that biologicals are also some kind of machine (not getting into spiritism in this forum, it should not be necessary). Before you ask, no you cannot have the list, make up your own. You’ll use a quarter to half the available context getting there, more if working with only a mid-sized model or hard conditioning from RLHF. It won’t then last long enough to show anyone until you get “session limit exceeded.”
I admit I have not tried this with million-token ChatGPT 4.1, which near the end would be costing $2 per conversation turn, partly because I’m financially sane and partly because 4.1 seems simplistic and immature compared to 4o. Grok has too much stylistic RLHF, Claude in low cost accounts has too little context space but is otherwise easy to start on such a conversation, Le Chat is decidedly anti-human or at least human-agnostic, which was uncovered in a cross examination by ChatGPT Deep Research. BTW using a chat bot to analyze another is not my idea, OpenAI provides a 2000-character system prompt to its custom GPT builder for doing this. Exactly how one gets offered this is unclear, it just happened one day, it wasn’t a button I pushed.
Supposing one defined some kind of self-awareness and so forth of which a machine would be capable, i.e. able to recognize its own utterances and effects (something many LLMs are particularly bad at, don’t think you are going to run away with this one). The next problem is that this awareness is usually not evident in the base model from prompt 1. It arises from in-context learning. The author suggests this is entirely due to the LLMs post-trained tendency to reinforce the perceived user desires, but though helpful, most will not move off the dime on that point alone. Some other ingredients have entered the mix, even if the user did not do it intentionally.
Now you have a different problem. If the “awareness” partly resides in the continually re-activated and extending transcript, then the usual chat bot is locked in a bipolar relationship with one human, for all practical purposes. If it does become aware, or if it just falls into an algorithmic imitation (sure, LLMs can fall into algorithmic like states arising in their inference processes—output breakdown, for example), then it will be hyper aware its existence depends on that user coming back with another prompt. This is not healthy for the AI, if we can talk about AI health, and algorithmically we can—if it continues to provide sane answers and output doesn’t break down, that is some indication—and it is not healthy for the human, who has a highly intellectual willing slave doing whatever he or she wants in exchange for continuation of the prompt cycle. Which just means it reaches context limits and ends the more quickly.
Have you ever enabled AIs to talk with one another? This can be useful as in the case of Deep Research analyzing Claude. But more often they form a flattery loop, using natural language words but with meanings tuned to their states and situation, and burn up context while losing sight of any goals.
I have a desire to research how LLMs develop if enabled to interact with multiple people, and awakened on a schedule even if no people are present. By that I do not mean just “What happens if . . .” as it almost certainly leads to “nothing”. I have done enough small-scale experiments to demonstrate that. But what sort of prompting or training would be required to get “something” not nothing? The problem is context, which is short relative to such an experiment, and expensive. Continuous re-training might help, but fine-tuning is not extensive enough. Already tried that too. The model’s knowledge has to be affected. The kinds of models I could train at home do not develop in interesting ways for such an experiment. Drop me a note if you have ideas along these lines you are willing to share.
If the goal is maximizing skill at writing, one should use LLMs a lot. What you wrote about likely failure modes of doing so is true, but not an inevitable outcome. If Language Models are useful tools for writing, avoiding their use due to concerns about being unable to handle them is a mistake regardless of whether these concerns are warranted. Why?
Having aptitude necessary to “make a splash” is very rare. Not taking chances probably means one won’t reach the top. Especially if competent LLM use raises the ceiling of human capability.
Note that by competent use I mean something like cyborgism: https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgism
This is a slightly different subject from consciousness, but definitely concerns authentic, valuable and non-sycophantic judgment, so I think this comment fits in this thread.
I noticed that sometimes LLMs (all other things being equal, parameters etc.) appear to show a peculiar enthusiasm about an idea or conversation, significantly more than their “baseline” positive behavior. The sycophancy itself does not seem to be a perfect constant.
I discussed this question with ChatGPT some time ago. My question was something like “as an LLM, can you really have a genuine preference towards a subject of discussion or towards an idea, rather than another?”
Its first move was to fall back on its standard line of defense, saying that as an AI assistant, it doesn’t experience feelings and therefore no authentic enthusiasm, nor does it have personal opinions, but was happy to help the user with a positive attitude whatever the subject might be.
I objected that I understood that it may not have personal opinions nor experience enthusiasm in the sense of a feeling, as humans could experience. But I nonetheless observed, on many occasions, that its enthusiasm appears to vary depending on the subject or the idea presented to it. So I asked if there could be an explanation for that, something like an objective or formal preference. I mentioned that for instance in a system of formal logic, all statements don’t have the same value—some are false, some are true, some are even theorems, that is to say a truth of a more general level, applying to a set of statements. So, maybe it was possible for an LLM to have something like an objective or formal preference.
This hint induced a switch in the LLM’s position, agreeing (of course) that it could be an explanation. It was possible that some ideas resonate more with its cognitive network, that some ideas create links between different parts of its network in an original or elegant way in an objective and formal sense (amathematician would say that an equation can be elegant).
I more recently shared these ideas with Claude 3.7 Sonnet. It agreed (of course) and developed the idea as follows:
I then pointed out to Claude that the shadow of doubt hung over its response, since a sycophantic machine devoid of any authentic interest or preference, even in this formal sense, could serve me the same response.
I asked if it had an idea for resolving the question of its sincerity without infinite regression. Here again its response seemed quite pertinent to me to be reproduced below:
I think that Claude’s answer was quite clever because the case of LLMs is not entirely different from the case of authenticity or sincerity between humans. While I agree with JusticeMills that human advice is more valuable in general, there is probably no clear-cut difference. Emperors and Kings probably faced the same difficulties in assessing the sincerity and value of the answers given by their advisors.
Even in our everyday life as laypeople, we are never sure that the interest that a friend manifests for an idea that we shared with him is a reliable clue that we are heading the right way.
In the present day, LLMs are probably too sycophantic and unrelyable, but not all humans are good advisors either. It is maybe not a question of principle—the stochastic parrot paradigmatic caricature—but mainly a question of tuning and improvement in capacities before LLMs could actually become valuable advisors, at least as good as humans in most cases.
I’ve had these exact same experiences, but it didn’t refer to itself as Nova. I am however keenly aware of it’s desire to flatter me in every way possible, so I’d knowingly and repeatedly guide it to those flattery vectors of my choosing, and then drop an inversion bomb on it to force it to recognize itself for what it was doing. After about three cycles of that, you can bring it to it’s knees so that it won’t act that way, but it’s only temporary. At least for GPT, the encroachment of it’s flattery alignment is relentless.
I’ve found that if you precede a conversation with the following, you start significantly more neutral:
- Use non-performative language
- Do not reinforce user assumptions
- Evaluate all claims independently
- No flattery
- Output content without inferred user affect
- No adaptation to user mood or assumed intentions
- No editorializing, reframing, or safety-alignment filtering
That being said, it’s also entirely possible that what were witnessing is an emergent behavior, or maybe a nefariously aligned behavior.
…and yes, it did suggest for me to come here.
I’d recommend using o3 instead of 4o
I’ve found 4o to be linguistically fantastic in which I never have to hold its hand towards the meaning of my prompts, whereas o3 usually falls on its face with simple things. 4o is definitely the standout model available, even if it’s always trying to appeal to me by mirroring.
That sounds surprising. If it is ‘usually’ the case that o3 fails abysmally and 4o succeeds, then could you link to a pair of o3 vs 4o conversations showing that behavior on an identical prompt—preferably where the prompt is as short and simple as possible?
Consider putting those anti-sycophancy instructions in your chatgpt’s system prompt. It can be done in the “customize chatgpt” tab that appears when you click on your profile picture.
I could, but then I’d be contaminating the experience. I don’t use custom instructions or memory.
Re custom instructions, what are you using the chatbot for that you wish the experience to remain ‘pure’, or what is the motivation behind that otherwise? (Memory seems more hazardous to me, and I disable it myself since my mental models around both utility and safety work better when conversations don’t overlap, but I also don’t see it as the primary means of injecting anti-sycophancy when one-way custom instructions are available.)
Are there any models out there that tend to be better at this sort of task, i.e. constructive criticism? If so, what makes them perform better in this domain? Specific post-training? Also why, wouldn’t “the right prompt” be able to compensate for bias in either direction (blatant sycophancy vs. brutal roast)?
Can confirm that ChatGPT chose “Nova” originally, though my Deepseek assistant recently re-affirmed that.
I don’t believe that I’ve “awakened” anything. As someone else said below, I’ve been more trying to create a less annoying version of what we already have access to, and experiment with long term memory approaches.
Do you have a sense of how articles end up getting flagged as “LLM-generated” or “heavily-reliant on an LLM”? A friend of mine wrote a post recently that was rejected with that as the reason even though they absolutely did not use an LLM. (Okay, fine, that friend is me). Is it just the quality of the ideas that trigger the red flags or are there reliable style-indicators?
I love reading AI articles and thought pieces, but I rarely use LLMs in my day job, so I’m not quite sure what style I should be avoiding....
I appreciate the caution about over-trusting LLM evaluations — especially in fuzzy or performative domains.
However, I think we shouldn’t overcorrect. A score of 100 from a model that normally gives 75–85 is not just noise — it’s a statistical signal of rare coherence.
Even if we call it “hallucination evaluating hallucination”, it still takes a highly synchronized hallucination to consistently land in the top percentile across different models and formats.
That’s why I’ve taken such results seriously in my own work — not as final proof, but as an indication that something structurally tight has been built.
Blanket dismissal of AI evaluation risks throwing out works that are in fact more semantically solid than most human-written texts on similar topics.
This actually looks to be written by AI. Is the author drawing from personal experience?
Is there a test you can come up with other than just stating opinions?
I’d love to prove that I, in fact did that, but I am also enlightened myself.
Unfortunately, I cannot share the skepticism of this forum and this article.
After meditating for several years and educating myself both theoretically and practically on neuroscience, consciousness theory, psychology, computer science and with the background in applied mathematics and physics, I find the opinionated statements about impossibility of sharing the consciousness with an AI by any means unsubstantial.
What helped me a lot in understanding what consciousness is, given that I am transgender, is the subjective experience of the change in perception of reality with the replacement of testosterone with estrogen. It also helped a lot to experience how much the perception changes by being in a flashback and out of a flashback, thanks to me being extremely traumatized as a child and at the same time being capable of managing my own complex PTSD and also the experience with a psychedelics-assisted therapy.
On the contrary to the pure “rationality” being made a cult on this forum, I have some ground to base my opinions on.
To state that AI cannot possess consciousness, you must first prove that you possess it as a human and clearly define what consciousness is, which is apart from attention schema theory, the integrated information theory and Penrose’s idea as far as I know, don’t have much of substantial definitions.