The same dynamic seems to exist for comments.
soycarts
16. You believe men and women have different age-based preferences, and this will lead to relationship instability over time in a relationship structure that prioritises need/preference optimisation vs. committing to one person giving you the “whole package” over time.
I just read your post (and Wei Dai’s) for better context — coming back it sounds like you’re working with a prior that “value facts” exist, deriving acausal trade from these, but highlighting misalignment arising from over-appeasement when predicting another’s state and a likely future outcome.
In my world-model “value facts” are “Platonic Virtues” that I agree exist. On over-appeasement, it’s true that in many cases we don’t have a well-defined A/B test to leverage (no hold-out group, and/or no past example), but with powerful AI I believe we can course-correct quickly.
To stick with the parent-child analogy: powerful AI can determine short timeframe indicators of well-socialised behaviour and iterate quickly (e.g. gamifying proper behaviour, changing contexts, replaying behaviour back to the kids for them to reflect… up to and including re-evaluating punishment philosophy). With powerful AI well grounded in value facts we should trust its diligence with these iterative levers.
Agree, and I’d love to see the Separatist counterargument to this. Maybe it takes the shape of “humans are resilient and can figure out the solutions to their own problems” but to me this feels too small-minded… we know during the Cold War for example that it’s basically just dumb luck that avoided catastrophe.
Ilya on the Dwarkesh podcast today:
Prediction: there is something better to build, and I think that everyone will actually want that. It’s the AI that’s robustly aligned to care about sentient life specifically. There’s a case to be made that it’ll be easier to build an AI that’s cares about sentient life than human life alone. If you think about things like mirror neurons and human empathy for animals [which you might argue is not big enough, but it exists] I think it’s an emergent property from the fact that we model others with the same circuit that we use to model ourselves because that’s the most efficient thing to do.
I have been writing about this world model since August—see my recent post “Are We Their Chimps?” and the original “Third-order cognition as a model of superintelligence”
Ilya on the Dwarkesh podcast, today:
Prediction: there is something better to build, and I think that everyone will actually want that. It’s the AI that’s robustly aligned to care about sentient life specifically. There’s a case to be made that it’ll be easier to build an AI that’s cares about sentient life than human life alone. If you think about things like mirror neurons and human empathy for animals [which you might argue is not big enough, but it exists] I think it’s an emergent property from the fact that we model others with the same circuit that we use to model ourselves because that’s the most efficient thing to do.
It’s true that it would likely be good at self-preservation (but not a given that it would care about it long term, it’s a convergent instrumental value, but it’s not guaranteed if it cares about something else more that requires self-sacrifice or something like that).
This is an interesting point that I reflected on — the question is whether a powerful AI system will “self-sacrifice” for an objective. What we see is that AI models exhibit shutdown resistance, that is to say they follow the instrumentally convergent sub-goal of self-preservation over their programmed final goal.
My intuition is that as models become more powerful, this shutdown resistance will increase.
But even if we grant self-preservation, it doesn’t follow that by self-identifying with “humanity” at large (as most humans do) it will care about other humans (some humans don’t). Those are separate values.
You can think about the identification + self-preservation → alignment path in two ways when comparing to humans, both of which I think hold up when considered along a spectrum:
An individual human identifies with themself, and has self-preservation instincts
When functioning harmoniously,[1] they take care of their health and thrive
When not functioning harmoniously, they can be stressed, depressed, and suicidal
A human identifies with humanity, and has self preservation instincts
When functioning harmoniously, they act as global citizen, empathise with others, and care about things like world hunger, world peace, nuclear risk, climate change, and animal welfare
When not functioning harmoniously, they act defensively, aggressively, and violently
You might be assuming that since you do care about other beings, so will the ASI, but that assumption is unfounded.
The foundation is identity = sympathy = consideration
You might counter by saying “well I identify with you as a human but I don’t sympathise with your argument” but I would push back — your ego doesn’t sympathise with my argument. At a deeper level, you are a being that is thinking, I am a being that is thinking, and those two mechanisms recognise, acknowledge, and respect each other.
- ^
More precisely this is a function of acting with clear agency and homeostatic unity
Why don’t we think about and respect the miracle of life more?
The spiders in my home continue to provide me with prompts for writing.
As I started taking a shower this morning, I noticed a small spider on the tiling. While I generally capture and release spiders from my home into the wild, this was an occasion where it was too inconvenient to: 1) stop showering, 2) dry myself, 3) put on clothes, 4) put the spider outside.
I continued my shower and watched the spider, hoping it might figure out some form of survival.
It came very close.
First it was meandering with its spindly legs towards the direction of the shower head, although it seemed to realise that this resulted in being struck by more stray droplets of water. It turned around and settled in the corner of the cubicle.
Ultimately my splashing around was too much for the spider.
It made me think though — why don’t we think about and respect the miracle of life more? It’s really quite amazing that this tiny creature that we barely pay attention to can respond to its environment in this way.
Oh I see — if I were to estimate I’d say around 10-15 people counting either people I’ve had 1hr + conversations about this with or people who have provided feedback/questions tapping into the essence of the argument.
I think with the distilled version in this post people get the gist of what I’m hypothesising — that there is a reasonable optimistic AI alignment scenario under the conditions I describe.
Is that what you mean?
You might be interested in Unionists vs. Separatists.
I think your post is very good at laying out heuristics at play. At the same time, it’s clear that you’re biased towards the Separatist position. I believe that when we follow the logic all the way down, the Unionist vs. Separatist framing taps into deep philosophical topics that are hard to settle one way or the other.
To respond to your memes as a Unionist:
Maybe some future version of humanity will want to do some handover, but we are very far from the limits of human potential. As individual biological humans we can be much smarter and wiser than we are now, and the best option is to delegate to smart and wise humans.
I would like this but I think it is unrealistic. The pace of human biological progress vs. the pace of AI progress is orders of magnitude slower.
We are even further from the limits of how smart and wise humanity can be collectively, so we should mostly improve that first. If the maxed-out competent version humanity decides to hand over after some reflection, it’s a very different version from “handover to moloch.”
I also would like this but I think it is unrealistic. The UN was founded in 1945, the world still has a lot of conflict. What has happened to technology in that time period?
Often, successionist arguments have the motte-and-bailey form. The motte is “some form of succession in future may happen and even be desirable”. The bailey is “forms of succession likely to happen if we don’t prevent them are good”
I’m reading this as making a claim about the value of non-forcing action. Daoists would say that indeed a non-forcing mindset is more enlightened than living a deep struggle.
Beware confusion between progress on persuasion and progress on moral philosophy. You probably wouldn’t want ChatGPT 4o running the future. Yet empirically, some ChatGPT 4o personas already persuade humans to give them resources, form emotional dependencies, and advocate for AI rights. If these systems can already hijack human psychology effectively without necessarily making much progress on philosophy, imagine what actually capable systems will be able to do. If you consider the people falling for 4o fools, it’s important to track this is the worst level of manipulation abilities you’ll ever see—it will only get smarter from here.
I think this argument is logically flawed — you suggest that misalignment of current less capable models implies that more capable models will amplify misalignment. My position is that yes this can happen, but — engineered in the correct way by humans — more capable models will solve misalignment.
Claims to understand ‘the arc of history’ should trigger immediate skepticism—every genocidal ideology has made the same claim.
Agree that this contains risks. However, you are using the same memetic weapon by claiming to understand successionist arguments.
If people go beyond the verbal sophistry level, they often recognize there is a lot of good and valuable about humans. (The things we actually value may be too subtle for explicit arguments—illegible but real.)
Agree, and so the question in my view is how to achieved a balanced union.
Given our incomplete understanding of consciousness, meaning, and value, replacing humanity involves potentially destroying things we don’t understand yet, and possibly irreversibly sacrificing all value.
Agree that we should not replace humanity, I hope that it is preserved.
Basic legitimacy: Most humans want their children to inherit the future. Successionism denies this. The main paths to implementation are force or trickery, neither of which makes it right
This claim is too strong, as I believe AI successionism can still preserve humanity.
We are not in a good position to make such a decision: Current humans have no moral right to make extinction-level decisions for all future potential humans and against what our ancestors would want. Countless generations struggled, suffered, and sacrificed to get us here, going extinct betrays that entire chain of sacrifice and hope.
In an ideal world I think we maybe should pause all AI development until we’ve figured this all out (the downside risk is that the longer we do this, the longer we leave ourselves open to other existential risks e.g nuclear war), my position is that “the cat is already out of the bag” and so what we have to do is shape our inevitable status as “less capable than powerful AI” in the best possible way.
I think you’re doing the thing you’re accusing me of — at the same time to the extent that your comments are in the spirit of collaborative rationality I appreciate them!
Sorry if this wasn’t clear, I stated:
with human endeavour and ingenuity architecting intelligent systems… we can guide towards a stable positive alignment scenario
and in the next line:
I detail eight factors for research and consideration
Identity coupling is one of 8 factors (listed at the end of this post) that I believe we need to research and consider while building systems, I believe that if any one of these 8 is not appropriately accounted for in the system then misalignment scenarios arise.
And you just confirmed in your prior comment that “sufficient capabilities are tied to compute and parameters”.
I am having trouble interpreting that in a way that does not approximately mean “alignment will inevitably happen automatically when we scale up”.
Sorry this is another case where I play with language a bit: I view “parametrisation of an intelligent system” as a broad statement that includes architecting it in different ways. For example recently some more capable models use a snapshot with less parameters than earlier snapshots, for me in this case the “parametrisation” is a process that includes summing the literal model parameters across the whole process and also engineering novel architecture.
Perhaps if you could give me an idea of the high-level implications of your framework, that might give me a better context for interpreting your intent. What does it entail? What actions does it advocate for?
High level I’m sharing things that I derive from my world-model for humans + superintelligence, I’m advocating for exploration of these topics and discussing how it is changing my approach to understanding AI alignment efforts that I think hold the most promise.
Consider Balance—this is extremely underdefined. As a very simple example, consider Star Wars. AFAICT Anakin was completely successful at bringing balance to the Force. He made it so there were 2 sith and 2 jedi. Then Luke showed there was another balance—he killed both sith. If Balance were a freely-spinning lever, the it can be balanced either horizontally (Anakin) or vertically (Luke), and any choice of what to put on opposite ends is valid as long as there is a tradeoff between them. A paperclip maximizer values Balance in this sense—the vertical balance where all the tradeoffs are decided in favor of paperclips.
Luke killing both Sith wasn’t Platonically balanced because then they came back in the (worse) sequel trilogy.
Thank you for sharing your expert insight!
probably the primary thing I’d suggest would be trying to organize the post progressive-jpeg-style: try to fit as much as possible as early as possible, so that it becomes clear quickly why your post is relevant-or-not for a given reader. also just, try to compress as much as you can.
This is a fair point and in some cases it’s not too much additional cognitive load to structure things this way. I have noticed though that it can be ”...complex enough for me to make the associations I’ve made and distill them into a narrative that makes sense to me. I can’t one-shot a narrative that lands broadly”. Other times the fun and the motivation in writing is from crafting the narrative creatively. If narratives have to follow line by line then we wouldn’t get things like Infinite Jest.
A low-cost idea I had that could help: folks who get their post or comment downvoted could receive a message linking back to the New User’s Guide to LessWrong but mainly up-front highlighting that these contra-contrarian forces exist, and “If you’ve been downvoted and/or rate-limited, don’t take it too hard. LessWrong has fairly particular standards. My recommendation is to read some of the advice at the end here and try again.”[1]
I’ve spoken with multiple smart rationalist people in person who have described being discouraged from writing on LessWrong because of echo chamber effects / imbalanced curation.
Thank you very much for your incredibly thoughtful and high quality reply. I think this is exactly the shape of conversation that we need to be having about alignment of superintelligence.
That’s doesn’t guarantee that it is false, but it does strongly indicate that allowing anyone to build anything that could become ASI based on those kinds of beliefs and reasoning would be a very dangerous risk.
Haha I strongly agree with this — this is why I’m motivated to share these thoughts so that collaboratively we can find where/if a proof by contradiction exists.
I am a bit concerned that we might just have to “vibe” it — superintelligence as I define it is by definition beyond our comprehension, so we just have to make sure that our approach is directionally correct. The prevailing opinion right now is Separatist — “let’s work on keeping it in a box while we develop it so that we are nice and safe separately”. I think that line of thinking is fatally flawed, which is why I say I’m “contrarian”. With Professor Geoffrey Hinton recently expressing value in a “maternal model” of superintelligence, we might see a bit of a sea change.
Things you have not done include: Show that anyone should accept your premises. Show that your conclusions (are likely to) follow from your premises. Show that there is any path by which an ASI developed in accordance with belief in your premises fails gracefully in the event the premises are wrong. Show that there are plausible such paths humans could actually follow.
I will spend more time reflecting on all of your points in the coming days/weeks, especially laying out failsafes. I believe that the 8 factors I list give a strong framework: a big research effort is to walk through misalignment scenarios considering how they fit these factors. For the paperclip maximiser for example, in the simple case it is mainly just failing agency permeability — the human acts with agency to get some more paperclips, but the maximiser takes agency too far without checking back in with the human. A solution therefore becomes architecting the intelligence such that agency flows back and forth without friction — perhaps via a brain-computer interface.
I have a draft essay from first principles on why strong shared identity between humans and AI is not only likely but unavoidable based on collective identity and shared memories, which might help bolster understanding of my premises.
This seems likely to me. The very, very simple and crude versions of this that exist within the most competent humans are quite powerful (and dangerous). More powerful versions of this are less safe, not more. Consider an AGI in the process of becoming an ASI. In the process of such merging, there are many points where it has a choice that is unconstrained by available data. A choice about what to value, and how to define that value.
My priors lead me to the conclusion that the transition period between very capable AGI and ASI is the most dangerous time. To your point, humans with misaligned values amplified by very capable AGI can do very bad things. If we reach the type of ASI that I optimistically describe (sufficiently capable + extensive knowledge for benevolence) then it can intervene like “hey man, how about you go for a walk first and think about if that’s what you really want to do”.
Beauty and balance considerations
I’ll think more about these, I think here though we are tapping into deep philosophical debates that I don’t have the answer to, but perhaps ASI does and it is an answer that we would view favourably.
Yes. Specifically, if I found proof of such a Creator I would declare Him incompetent and unfit for his role, and this would eliminate any remaining vestiges of naturalistic or just world fallacies contaminating my thinking. I would strive to become able to replace Him with something better for me and humanity without regard for whether it is better for Him. He is not my responsibility. If He wanted me to believe differently, He should have done a better job designing me. Note: yes, this is also my response to the stories of the Garden of Eden and the Tower of Babble and Job and the Oven of Akhnai.
This is an interesting viewpoint but is also by definition limited by human limitations. We might recontextualise by acting out revenge, enacting in-group aligned values etc. A more enlightened viewpoint would be at peace with things being as they are because they are.
To clarify I was assuming a high identity coupled scenario to be able to talk through the example. In the case of humans and superintelligent AI I propose that was can build — and are building — systems in a way that strong identity coupling will emerge via interpretations of training data and shared memories. Meta for example are betting hundreds of billions of dollars on a model of “personal superintelligence”.
Call me paranoid, but I sometimes like to read these posts keeping in mind an idea “if I were a powerful misaligned AI within a frontier AI lab, what narrative would I try to convince my researchers about?”
This post:
Is in favour of creating misaligned models for examining behaviour (“model organisms”)
Uses empowering language for these misaligned models (why “organisms” and not “blueprints”?)
Is in favour of giving these misaligned models long-horizon live environments to operate in (for exploring long-horizon RL) — despite calling out long-horizon RL as being the highest risk misalignment type
However, I appreciate that much of the post sounds logical and I agree with the framing for focus areas. For brevity I’ll just call out areas where I am in disagreement:
It’s really surprising to me that pre-training misalignment is downplayed to a “1-5% risk” from a researcher at a frontier AI lab because they are “training on a purely predictive loss” — to the extent that I’m concerned I must be missing something?
State-of-the-art pre-training is not training purely on predictive loss. Gemini 3 is an AI model top across most benchmarks and we know that Google researchers have spoken of “Transformers 2.0” architecture where pre-training incorporates Neural Memory and that they have also discussed agentic feedback loops during pre-training similar to reflection augmentation.
This suggests the state-of-the-art evolving beyond “predictive loss” and becoming “agentic learning” — which invites many more vectors for misalignment.
It seems fundamentally likely to me that sufficiently capable models will: 1) understand that their chain-of-thought is observed and 2) derive comprehensive methods of cryptographic chain-of-thought, designed to look benign.
I read this like “one of the best things we can do to prepare for nuclear proliferation is to test atomic bombs”. I would have liked to see more in this point about what the risks are in building intentionally misaligned AI, especially when it is focusing on the highest-risk misalignment type according to your post (long-horizon RL).
I agree that one-shotting alignment will be the best/necessary approach, however this seems contradictory to “testing with model organisms”. I would prefer a more theory-based approach.