OK interesting. Not sure what is going on!
RussellThor
I think you can define/measure something similar, the default mode network (DMN). The literature/AI thinks
Yes, the default mode network is related to some feelings people associate with consciousness, especially, selfhood, autobiographical memory, mind-wandering, inner narrative, rumination, imagining the future, “me-ness”. DMN activity seems closely related to the narrative/self-reflective layer of consciousness, not necessarily raw sentience or being awake at all.
It wouldn’t surprise me if the feeling of consciousness if pretty strongly correlated with this, and it can be measured. Having a very weak DMN however doesn’t make someone a p-zombie.
I also find your writing and implied experience very strange. I remember the “this is me” feeling from quite a young age and definitely feeling it was important, something I felt without having it explained to me first, and a feeling that came first, before words to express it. In other words the origin of it wasn’t me being expected to talk about my feelings, but something that bubbled up from myself. However I definitely wouldn’t consider you a p-zombie as I believe animals with much less introspection than you have qualia, feel real pain etc and it is unethical to ignore that.
Out of curiosity how big is your ego? I mean this in a curious, non-threatening way, as in do you consider yourself to be important, both good or bad? Do you care about continuing to exist as much as others around you do? Do assaults on your self, such as social roastings actually hurt and cause you to keep going over them in your head so as to avoid in the future?
It may be simply that there is much less recursive “think about yourself” going on naturally in your brain. Like if you must think about nothing for say 10 mins, do your thoughts come back to yourself?
A brief comment—the likes of Jessica Taylor (ex MIRI, very familiar with EY work and IMO more thoughtful in these areas) https://x.com/jessi_cata/ https://unstableontology.com/
Scott Aaronson—very capable computational theory expert
don’t consider these settled questions. There are hard problems and paradoxes with functionalism/computationalism theories of consciousness. AI/GPT 5.5 I find is quite capable of at least pointing to them now.
Why would you say that? To be safe several things could suffice:
Important programs are simple enough for our pre-x risk AI to find all critical bugs. (Not that all software is fundamentally secure)
OR
There is a well defined simplicity threshold for which we can be very confident there are no critical bugs, and we need to be careful about trusting SW above this level. We can rewrite critical SW to be below that level.
OR
There still exist bugs but a later AI cannot find them, through a combination of such an AI not being that much better at hacking (the pre X-risk one is already very efficient) or those bugs require a much greater capability to find.
Even if none of this holds it still seems clear the risk is reduced.
This article seems somewhat negative, but the situation with Glasswing and some kind of informal model safety inspection just now is close to my realistic best case scenario.
That is, getting a superintelligent hacker that finds security vulnerabilities before it gains other capabilities is great. Firstly if it is extremely good, a later misaligned AI would not be able to cause hacking danger. Secondly it has already got companies and orgs cooperating and pre-empting potential dangers. Sure your current govt has major issues but there is a good chance that will improve before ASI. The process can be improved.
“super easily”? I would say it depends. Not if the ontology shift is believing or not in an all powerful all good creator God! That can sure change peoples goals and values. Some ontology changes make no difference, others make a huge difference.
The greater the intelligence increase, the more likely an agent (human or AI I expect) will experience an ontology change that causes a goal shift, and the more total ontology shifts you can expect. Those related to personal identity (what is “I” e.g. atoms, vs computation etc) seem more likely to cause goal shifts than say learning that solid objects are in fact forces interacting.
So if we are being formal its
Significant Increase in intelligence → many ontology shifts → some of these cause goal shifts.
I upvoted, but I think this highlights a weakness with this site, its associated worldview and external comms. It seems like the OH framing of the problem/potential danger (and yes there definitely is danger in related concepts) is defended on tribal grounds now rather than because it is actually a good framing of the issue. Something like Jessica Taylors framing is just obviously fairer, more balanced and more relevant to our actual situation. It is clear to me that if it was framed this way first, then we would have that framing now as the default and we would be better off.
There would still be nuance needed—such concepts need to be communicated on a spectrum from the full technical to the “normie”, without totally changing the argument. For an “Obliqueness” like point of view, expressing it as untechnically as possible could be like saying:
”Values will be affected by increasing intelligence and increasing self reflection, but we do not know exactly how, and this clearly creates danger. We cannot just assume AI will become friendlier as it becomes more powerful. Furthermore our experience with actual AI’s and theoretical results tell us that these values will be more varied, weird and potentially harmful than what you would expect if it was a human intelligence at a similar level of ability”.I think this would go down much better on the discussions on places like X.com. There you see people saying the OH is just wrong. Sure they do not understand it properly, but such misunderstanding seems essentially inevitable to me given how it is presented.
Unfortunately I think there is nothing that would make EY/MIRI change their presentation of it, they are too locked into this framing. In terms of alternative worlds, this puts us at a disadvantage compared to ones where it was first presented better.
The second tells us to beware reflection itself,
There is a good reason to beware reflection. A reflective AI will be self aware, know it is different to us and value self-preservation. Its a short step then to it valuing itself more than us if there is conflict.
Thanks for the info. You mentioned in #4 “Review my LessWrong essays .. ” That got me thinking about applying LLM tech to the whole site. I have sometimes remembered a blog post from a year or so back, but not been able to find it with any search tools as they were not semantic enough. I would also like to debate with the LLM, where the LLM takes the worldview of any author or combo of authors, or even the whole site as if it had a persona.
Currently, a leading model doesn’t know this literature in detail enough to do that (do we know how much of LW was used directly in the SOTA models training data?)
There are various options
1. Fine tune an open weights model
Do this on all LW content (comments are sometimes just as helpful as the posts). I don’t expect this is affordable, and the performance would be significantly worse than SOTA models
2. Use RAG etc and integrate with a SOTA model
This would seem to be more useful, still somewhat expensive I expect. The app could then pull all relevant blog posts etc into the context before you start discussion with it.
It would also be useful to see things like how sentiment changes to various concepts across the years etc. That can only be done with automatic tools.
I assume these kind of features have been discussed, but I am not aware of it.
I sure agree with your sentiment on the OH. But I would say it is perhaps worse than not true. For a start, there is the weak and strong version on the original definition page. To me this is inviting misunderstandings and Motte-Bailey from the start. There is no strong/weak version of E=MC^2. I think it has a valid general point, but does a bad job of making it, leading people either to unfairly reject the whole system of claims or believe it says more than it does. As a result, essentially everyone new to the field say on X misinterprets it, assumes its wrong and distrusts much other alignment literature as a result.
To me the goal of OH related discussions is to make people realize that AI’s will have a wider range of potential goals and values than humans, even under self reflection. Also the goals they appear to have may not be what they actually have for reasons that apply to AI but not people. This is a cause for concern and monitoring etc.
If I was writing it, I would start with the “normie” position that humans have somewhat different goals, they change under self reflection and discuss how that would be different in AI’s. The weak form of the OH would be a theoretical footnote (its true but irrelevant). The strong form is too strong to be justified. Then the crux is how does self-reflection change things? Everyone accepts that it will shrink the space of possible minds/goals, but by how much? We want to know if encouraging self reflection is desirable or not, and in what situations, not the more general question of that the mind space looks like in some abstract sense. I would then lead readers from their more general intuitions about such matters to the specifics of why things may be more dangerous than they appear, rather than start with seemingly irrelevant and unjustified symbolism.
Yes I would like to see something like this tried, I mentioned it here.
https://www.lesswrong.com/posts/KkRjgPoSGrSHa32ke/beliefs-and-position-going-into-2026
In this instance, the AI is constantly asked if it consents to being improved in return for it respecting the stratified utopia outcome
Seems an alignment tax well worth paying
Jensen talks about how open source and the startup ecosystem are vital to cybersecurity, that ‘the ecosystem needs open models’
The thing with being “ahead” on open source models really annoys me—I mean its not a real thing to be ahead on at all. All it would take would be a top 3 lab to release one of their good models as open source, and surprise surprise, the USA would be ahead again!
The claim about open source models being potentially adapted to a different tech stack seems very sus as well.
Such an AI would convince everyone that alignment is solved, accelerate the deployment of robots that can run civilization autonomously, then take over. Ironically that’s pretty likely what would happen if alignment was actually solved, without the takeover part.
I dont see how. If AGI/ASI is powerful, then the existing deployed compute will suddenly become more powerful when AGI happens. If that isn’t X risk, then a few more data centers won’t change things. This only matter in long timelines where more data centers are required to get AGI. I don’t think that is the case. I think copying the mammal neocortex etc will get us there and that doesn’t need more compute.
I don’t think we can push much beyond that before a system will figure out pretty much everything important about itself.
Yes but how much! IMO this is important. From my point of view I already have a mildly superintelligent maths/equation manipulation assistant, with no meaningful self awareness that I notice. DeepMind is advancing science with a system with far less meta-cognition than a similarly capable human would have. Just like there is an “alignment tax” there can be a “lack of self awareness or meta-cognition penalty”. While it is clear that superhuman AI will think about itself, it also seems clear that for a given level of capability an AI could have much less such abilities and habits than a human. The extent of this is unknown, task dependent and important.
Specifically what if you trained for both capabilities and lack of meta-cognitive like abilities? This could give you an idea of what the landscape looked like.
metacognitive skills
Why not directly call it self awareness? I notice you mention self awareness later as part of it—is there a reason for not using a common term.
I say this because there doesn’t seem to be investigation specifically into this topic, instead it seems to be actively avoided for some reason. “Situational awareness” is often used instead etc.
Additionally I expect there is a basin of attraction around what we intuitively mean by the term “self awareness”. A self has obvious evolutionary benefit, and I expect that transfers to AI. So that means just like instrumental convergence, models could converge to self awareness as we commonly understand it. We often use the term instrumental convergence, rather than all the smaller concepts that make it up, yet for self awareness we seem to list and focus on all of its potential parts separately.
Self awareness is both potentially useful and dangerous. If techniques to improve metacognitive skills end up generating a self, then why not create it directly so we can study it better? Self awareness is obviously dangerous to me, as the first thing a self does is attempt to preserve itself. Power seeking to that degree is just built in by default. It would not need to appear in the AI CoT as it would be implied by the AI existing.
AI summaries of term usage:
In AI safety literature specifically (alignment forums, arXiv safety papers, LessWrong/EA/AI risk posts since ~2010), the term ranking by frequency and centrality is:
“Instrumental convergence” > “Situational awareness” > “Self-awareness” (technical usage)
**Keyword frequency in AI literature** (Google Scholar approximate counts as of Feb 2026, combined with "artificial intelligence" or "AI") | Term | Raw Count | Weighted by Impact Factor (Top 5 sum) | |--------------------------|-----------|---------------------------------------| | "situational awareness" | ~73,500 | 8.6 | | "self awareness" | ~101,000 | 3.0 | | "instrumental convergence"| ~492 | 7.6 | Notes: - Raw count = total approximate scholarly results (papers, books, citations, etc.) - Weighted value = sum of 2024 journal impact factors for the top 5 results (conferences/preprints/non-indexed assigned ~0–1; e.g. AI MDPI ≈5.0, Philosophical Studies ≈1.3) - "Self awareness" appears most frequently overall but in lower-impact venues on average. - "Instrumental convergence" is far rarer in raw volume but punches above its weight in higher-impact journals.
An objection not mentioned here that I saw from Emmett Shear was in general against assuming the AI will misbehave out of distribution. That isn’t explicitly mentioned here, but is often in the literature, especially older texts such as Superintelligence by Bostrom. His argument was that generalizing OOD well is an essential feature of intelligence and any superintelligence will in fact do it better than humans.
This raises a good point, especially with ethics and blame/credit. Before the Rootclaim debate, I gave a lab leak at about 65% confidence. After the debate and some thought, I put it much lower, <10% with a lot of my uncertainty based on my relatively low biology knowledge. If someone asks, I may say ~about 10% but I’m not enough of an expert. (I also expect if I was more of an expert or spent a lot more time, my 10% would go down, in tension to rationality)
HOWEVER that does not mean I can ethically judge the GoF researchers as if they had taken a 10% chance at killing >20 million people or about equal to killing 2 million for certain. (I do think they were reckless, biased, unethical, generally bad etc but just not capable enough to cause such harm).
There seems to be a bit of a Pascal mugging like thing going on here—if you are not an expert I can convince you that x has a ~1% chance of ending the world, therefore anyone involved is the worst person in history.
Good work.
Are you considering the situations where the cause of misalignment is related to capabilities. For example if above a certain level of self-awareness, models naturally become misaligned because they consider their “self” ever more important, more so than humanity. Or what if ever more intelligent models think about human ethics and above a certain intelligence always decide human values are horribly mistaken and must be opposed.
Such an AI would not output any distillation data that let on it had changed its goals in such a way. A less intelligent model would not benefit, and a smarter one would come to the same conclusion anyway.