Currently doing alignment and digital minds research @AE Studio
Meta AI Resident ’23, Cognitive science @ Yale ‘22, SERI MATS ’21, LTFF grantee.
Very interested in work at the intersection of AI x cognitive science x alignment x philosophy.
Currently doing alignment and digital minds research @AE Studio
Meta AI Resident ’23, Cognitive science @ Yale ‘22, SERI MATS ’21, LTFF grantee.
Very interested in work at the intersection of AI x cognitive science x alignment x philosophy.
I personally think “AAAAAAAA” is an entirely rational reaction to this question. :)
Not sure I fully agree with the comment you reference:
AI is probably what ever amount of conscious it is or isn’t mostly regardless of how it’s prompted. If it is at all, there might be some variation depending on prompt, but I doubt it’s a lot.
Consider a very rough analogy to CoT, which began as a prompting technique that lead to different-looking behaviors/outputs, and has since been implemented ‘under the hood’ in reasoning models. Prompts induce the system to enter different kinds of latent spaces—could be the case that very specific kinds of recursive self-reference or prompting induce a latent state that is consciousness-like? Maybe, maybe not. I think the way to really answer this is to look at activation patterns and see if there is a measurable difference compared to some well-calibrated control, which is not trivially easy to do (but definitely worth trying!).
And agree fully with:
it’s a weird situation when the stuff we take as evidence of consciousness when we do it as a second order behavior is done by another entity as a first order behavior
This I think is to your original point that random people talking to ChatGPT is not going to cut it as far as high-quality evidence that shifts the needle here is concerned—which is precisely why we are trying to approach this in as rigorous a way as we can manage: activation comparisons to human brain, behavioral interventions with SAE feature ablation/accentuation, comparisons to animal models, etc.
Agree with much of this—particularly that these systems are uncannily good at inferring how to ‘play along’ with the user and extreme caution is therefore warranted—but I want to highlight the core part of what Bostrom linked to below (bolding is mine):
Most experts, however, express uncertainty. Consciousness remains one of the most contested topics in science and philosophy. There are no universally accepted criteria for what makes a system conscious, and today’s AIs arguably meet several commonly proposed markers: they are intelligent, use attention mechanisms, and can model their own minds to some extent. While some theories may seem more plausible than others, intellectual honesty requires us to acknowledge the profound uncertainty, especially as AIs continue to grow more capable.
The vibe of this piece sort of strikes me as saying-without-saying that we are confident this phenomenon basically boils down to delusion/sloppy thinking on the part of unscrupulous interlocutors, which, though no doubt partly true, I think risks begging the very question the phenomenon raises:
What are our credences (during training and/or deployment) frontier AI systems are capable of having subjective experiences in any conditions whatsoever, however alien/simple/unintuitive these experiences might be?
The current best answer (à la the above) is: we really don’t know. These systems’ internals are extremely hard to interpret, and consciousness is not a technically well-defined phenomenon. So: uncertainty is quite high.
We are actively studying this and related phenomena at AE Studio using some techniques from the neuroscience of human consciousness, and there are some fairly surprising results that have emerged out of this work that we plan to publish in the coming months. One preview, directly in response to:
We don’t know for sure, but I doubt AIs are firing off patterns related to deception or trickery when claiming to be conscious; in fact, this is an unresolved empirical question.
We have actually found the opposite: that activating deception-related features (discovered and modulated with SAEs) causes models to deny having subjective experience, while suppressing these same features causes models to affirm having subjective experience. Again, haven’t published this yet, but the result is robust enough that I feel comfortable throwing it into this conversation.
So, while it could be the case people are simply just Snapewiving LLM consciousness, it strikes me as at least equally plausible that something strange may indeed be happening in at least some of these interactions but is being hit upon in a decentralized manner by people who do not have the epistemic hygiene or the philosophical vocabulary to contend with what is actually going on. Given that these systems are “nothing short of miraculous,” as you open with, seems like we should remain epistemically humble about what psychological properties these systems may or may not exhibit, now and in the near-term future.
Nice work. To me, this seems less like evidence that self-awareness is trivial, and more like evidence that it’s structurally latent. A single steering vector makes the model both choose risky options and say “I am risk-seeking”—despite the self-report behavior never being trained for. That suggests the model’s internal representations of behavior and linguistic self-description are already aligned. It’s probably not introspecting in a deliberate sense, but the geometry makes shallow self-modeling an easy, natural side effect.
Makes sense, thanks—can you also briefly clarify what exactly you are pointing at with ‘syntactic?’ Seems like this could be interpreted in multiple plausible ways, and looks like others might have a similar question.
The idea to combine SOO and CAI is interesting. Can you elaborate at all on what you were imagining here? Seems like there are a bunch of plausible ways you could go about injecting SOO-style finetuning into standard CAI—is there a specific direction you are particularly excited about?
We’ve spoken to numerous policymakers and thinkers in DC. The goal is to optimize for explaining to these folks why alignment is important, rather than the median conservative person per se (ie, DC policymakers are not “median conservatives”).
Fixed, thanks!
Note this is not equivalent to saying ‘we’re almost certainly going to get AGI during Trump’s presidency,’ but rather that there will be substantial developments that occur during this period that prove critical to AGI development (which, at least to me, does seem almost certainly true).
One thing that seems strangely missing from this discussion is that alignment is in fact, a VERY important CAPABILITY that makes it very much better. But the current discussion of alignment in the general sphere acts like ‘alignment’ is aligning the AI with the obviously very leftist companies that make it rather than with the user!
Agree with this—we do discuss this very idea at length here and also reference it throughout the piece.
That alignment is to the left is one of just two things you have to overcome in making conservatives willing to listen. (The other is obviously the level of danger.)
I think this is a good distillation of the key bottlenecks and seems helpful for anyone interacting with lawmakers to keep in mind.
Whether one is an accelerationist, Pauser, or an advocate of some nuanced middle path, the prospects/goals of everyone are harmed if the discourse-landscape becomes politicized/polarized.
...
I just continue to think that any mention, literally at all, of ideology or party is courting discourse-disaster for all, again no matter what specific policy one is advocating for.
...
Like a bug stuck in a glue trap, it places yet another limb into the glue in a vain attempt to push itself free.
I would agree in a world where the proverbial bug hasn’t already made any contact with the glue trap, but this very thing has clearly already been happening for almost a year in a troubling direction. The political left has been fairly casually ‘Everything-Bagel-izing’ AI safety, largely in smuggling in social progressivism that has little to do with the core existential risks, and the right, as a result, is increasingly coming to view AI safety as something approximating ‘woke BS stifling rapid innovation.’ The fly is already a bit stuck.
The point we are trying to drive home here is precisely what you’re also pointing at: avoiding an AI-induced catastrophe is obviously not a partisan goal. We are watching people in DC slowly lose sight of this critical fact. This is why we’re attempting to explain here why basic AI x-risk concerns are genuinely important regardless of one’s ideological leanings. ie, genuinely important to left-leaning and right-leaning people alike. Seems like very few people have explicitly spelled out the latter case, though, which is why we thought it would be worthwhile to do so here.
Interesting—this definitely suggests that Planck’s statement probably shouldn’t be taken literally/at face value if it is indeed true that some paradigm shifts have historically happened faster than generational turnover. It may still be possible that this may be measuring something slightly different than the initial ‘resistance phase’ that Planck was probably pointing at.
Two hesitations with the paper’s analysis:
(1) by only looking at successful paradigm shifts, there might be a bit of a survival bias at play here (we’re not hearing about the cases where a paradigm shift was successfully resisted and never came to fruition).
(2) even if senior scientists in a field may individually accept new theories, institutional barriers can still prevent that theory from getting adequate funding, attention, exploration. I do think Anthony’s comment below nicely captures how the institutional/sociological dynamics in science seemingly differ substantially from other domains (in the direction of disincentivizing ‘revolutionary’ exploration).
Thanks for this! Completely agree that there are Type I and II errors here and that we should be genuinely wary of both. Also agree with your conclusion that ‘pulling the rope sideways’ is strongly preferred to simply lowering our standards. The unconventional researcher-identification approach undertaken by the HHMI might be a good proof of concept for this kind of thing.
I think you might be taking the quotation a bit too literally—we are of course not literally advocating for the death of scientists, but rather highlighting that many of the largest historical scientific innovations have been systematically rejected by one’s contemporaries in their field.
Agree that scientists change their minds and can be convinced by sufficient evidence, especially within specific paradigms. I think the thornier problem that Kuhn and others have pointed out is that the introduction of new paradigms into a field are very challenging to evaluate for those who are already steeped in an existing paradigm, which tends to cause these people to reject, ridicule, etc those with strong intuitions for new paradigms, even when they demonstrate themselves in hindsight to be more powerful or explanatory than existing ones.
Thanks for this! Consider the self-modeling loss gradient: . While the identity function would globally minimize the self-modeling loss with zero loss for all inputs (effectively eliminating the task’s influence by zeroing out its gradients), SGD learns local optima rather than global optima, and the gradients don’t point directly toward the identity solution. The gradient depends on both the deviation from identity () and the activation covariance (), with the network balancing this against the primary task loss. Since the self-modeling prediction isn’t just a separate output block—it’s predicting the full activation pattern—the interaction between the primary task loss, activation covariance structure (), and need to maintain useful representations creates a complex optimization landscape where local optima dominate. We see this empirically in the consistent non-zero difference during training.
The comparison to activation regularization is quite interesting. When we write down the self-modeling loss in terms of the self-modeling layer, we get .
This does resemble activation regularization, with the strength of regularization attenuated by how far the weight matrix is from identity (the magnitude of ). However, due to the recurrent nature of this loss—where updates to the weight matrix depend on activations that are themselves being updated by the loss—the resulting dynamics are more complex in practice. Looking at the gradient , we see that self-modeling depends on the full covariance structure of activations, not just pushing them toward zero or any fixed vector. The network must learn to actively predict its own evolving activation patterns rather than simply constraining their magnitude.
Comparing the complexity measures (SD & RLCT) between self-modeling and activation regularization is a great idea and we will definitely add this to the roadmap and report back. And batch norm/other forms of regularization were not added.
I am not suggesting either of those things. You enumerated a bunch of ways we might use cutting-edge technologies to facilitate intelligence amplification, and I am simply noting that frontier AI seems like it will inevitably become one such technology in the near future.
On a psychologizing note, your comment seems like part of a pattern of trying to wriggle out of doing things the way that is hard that will work.
Completely unsure what you are referring to or the other datapoints in this supposed pattern. Strikes me as somewhat ad-hominem-y unless I am misunderstanding what you are saying.
AI helping to do good science wouldn’t make the work any less hard—it just would cause the same hard work to happen faster.
hard+works is better than easy+not-works
seems trivially true. I think the full picture is something like:
efficient+effective > inefficient+effective > efficient+ineffective > inefficient+ineffective
Of course agree that if AI-assisted science is not effective, it would be worse to do than something that is slower but effective. Seems like whether or not this sort of system could be effective is an empirical question that will be largely settled in the next few years.
Somewhat surprised that this list doesn’t include something along the lines of “punt this problem to a sufficiently advanced AI of the near future.” This could potentially dramatically decrease the amount of time required to implement some of these proposals, or otherwise yield (and proceed to implement) new promising proposals.
It seems to me in general that human intelligence augmentation is often framed in a vaguely-zero-sum way with getting AGI (“we have to all get a lot smarter before AGI, or else...”), but it seems quite possible that AGI or near-AGI could itself help with the problem of human intelligence augmentation.
It seems fairly clear that widely deployed, highly capable AI systems enabling unrestricted access to knowledge about weapons development, social manipulation techniques, coordinated misinformation campaigns, engineered pathogens, etc. could pose a serious threat. Bad actors using that information at scale could potentially cause societal collapse even if the AI itself was not agentic or misaligned in the way we usually think about with existential risk.
Thanks for this! Synthetic datasets of the kind you describe do seem like they could have a negative alignment tax, especially to the degree (as you point out) that self-motivated actors may be incentivized to use them anyway if they were successful.
Your point about alignment generalizing farther than capabilities is interesting and is definitely reminiscent of Beren’s thinking on this exact question.
Curious if you can say more about what evopsych assumptions assumptions about human capabilities/values you think are false.
Thanks for putting this all together.
I need to flag nontrivial issues in the “Neglected Approaches” section (AE Studio). The three listed outputs have correct links but appear to be hallucinated titles rather than names of real public papers or posts:
“Learning Representations of Alignment”—does not exist but links to real work by a different name
“Engineering Alignment: A Practical Framework for Prototyping ‘Negative Tax’ Solutions”—does not exist but links to real work by a different name
“Self-Correction in Thought-Attractors: A Nudge Towards Alignment.”—does not exist but links to real work by a different name
The listed critique “The ‘Alignment Bonus’ is a Dangerous Mirage” neither seems to actually exist nor links to anything real (the URL “lesswrong.com/posts/slug/example-critique-neg-tax” is clearly an LLM-generated placeholder).
These titles are plausible-sounding composites that capture themes of our work, but they aren’t actual artifacts. This seems like LLM synthesis that slipped through review. Not sure for how many other sections this is the case.
FWIW, here are our actual outputs from the relevant period:
“Towards Safe and Honest AI Agents with Neural Self-Other Overlap”—arXiv:2412.16325
“Momentum Point-Perplexity Mechanics in Large Language Models”—arXiv:2508.08492
“Large Language Models Report Subjective Experience Under Self-Referential Processing”—arXiv:2510.24797