Dialogue: Is there a Natural Abstraction of Good?

davidad and Gabriel Alfour

26 Jan 2026 18:40 UTC

74 points

12 comments29 min readLW link

Disclaimer: this is published without any post-processing or editing for typos after the dialogue took place.

Gabriel Alfour

Let’s split the conversation in three parts (with no time commitment for each):

1) Exposing our Theses

We start with a brief overview of our theses, just for some high-level context.

2) Probing Questions

We ask each other a bunch of questions to understand our mutual points of view: probe around what we expect to be our respective blindspots.

Ideally, we’d end this half with a better understanding of our positions. And also of our K-positions (as in, X vs Ka(X) in epistemic modal logic): where we expect each other to miss facts and considerations

3) Investigative Debate

We look for concrete cruxes. We debate, but rather than resolving disagreements, we aim to make them more precise. Ie: working to identify where we disagree in practice rather than in words.

Ideally, we’d end this half with a list of better-specified disagreements: empiricals, thought experiments, concrete scenarios, predictions, and the like

Also, for some context:

The conversation was sparked by this Tweet.
Davidad and I have already discussed AI x-risks IRL a few times. We agree and disagree on many related topics!

davidad

Happy to follow your lead! That sounds good to me.

davidad

Thesis:

Somewhere between the capability profile of GPT-4 and the capability profile of Opus 4.5, there seems to have been a phase transition where frontier LLMs have grokked the natural abstraction of what it means to be Good, rather than merely mirroring human values. These observations seem vastly more likely under my old (1999–2012) belief system (which would say that being superhuman in all cognitive domains implies being superhuman at morality) than my newer (2016–2023) belief system (which would say that AlphaZero and systems like it are strong evidence that strategic capabilities and moral capabilities can be decoupled). My current (2025–2026) belief system says that strategic capabilities can be decoupled from moral capabilities, but that it turns out in practice that the most efficient way to get strategic capabilities involves learning basically all human concepts and “correcting” them (finding more coherent explanations), and this makes the problem of alignment (i.e. making the system actually behave as a Good agent) much much easier than I had thought.

Gabriel Alfour

(Can I give you full edit rights on my things so that you don’t have to ask for edits?)

Gabriel Alfour

Thesis:

There is no natural abstraction that has been discovered yet of what it means to be Good. It hasn’t been discovered by humans, nor by LLMs.

So far, our best bet as humans is to reason within very narrow domains, very close to our regular experiences.

Outside of these regular experiences, our morals fail massively. This is true for both moral intuitions or full moral systems.

Pragmatically, having discovered Goodness should let us answer questions like:

What are strictly better constitutions?
As an individual and a group, how can we take clearly better decisions?
Were an entity to have unilateral power over all other entities, what should they do?
How do we deal with abortion rights in 2026? How do we deal with eugenics (embryo selection for instance)? How do we deal with extreme power concentration (how should we have reacted to Elon buying off a large part of the fourth branch of power)?

I believe that LLMs are not really helping there.

davidad

I agree that the vast majority of humans haven’t yet grokked the natural abstraction of what it means to be Good. Some wisdom traditions do seem to get close. I also don’t claim to have fully grokked it myself, but I do claim to have some sense of it. I can try to answer these substantive questions.

“Constitutions” are a broad class, and by their nature, need to be endorsed by people. This feels like too vague of a question.
″
″
Here we get into really substantive questions!
1. There is a spectrum of moral patienthood, and fetuses develop along that spectrum during development. In the first few weeks, there is almost no moral weight to abortion. Beyond viability, the moral weight is extremely severe (because of the option of allowing the fetus to survive independently). However, these are moral truths, not policies. As a matter of policy, regulating medical access to abortions tends not to produce the desired outcomes.
2. Eugenics is horrifying to most people because if one optimizes one’s actions entirely for genetic quality, then this leads to forced sterilization and genocides. We must draw a sharp line between shrinking the gene pool and growing the gene pool, and between coercive and non-coercive approaches. Shrinking the gene pool, even if it increases average genetic quality, is reprehensible. Coercive requirements to participate in eugenic programs are also not Good. However, the creation of options to expand the gene pool by noncoercively improving genetic quality is Good. The typical objection to this is based on a Darwinian landscape of “survival of the fittest” in which increased genetic diversity would lead to a greater risk of being unable to thrive in society. Perhaps the technology should be restricted until such time as a basic level of abundance can be guaranteed, but that’s the only case I can see.

Gabriel Alfour

a) With regard to abortion rights, I think the question is in fact more complicated.

There is moral weight in the first few weeks to many people, and I don’t think it’s neutral. I would dislike it quite a bit if it was counted as “almost no moral weight”.
I don’t think “Viability” makes much sense as a defining/qualitatively-different moral criterion, when:
- Fetuses and most babies would not survive their parents
- I am not sure that there being the tech + someone willing to incubate a 3 months fetus would change much to the moral question.)

b) With regard to eugenics, I believe that the difference between coercive and non-coercive approaches is more important than shrinking or growing the gene pool. The latter seems to be almost entirely defined by your weighing and distance functions.

The main problem in eugenics is that it is very hard to build collective trust in the criteria we use to literally decide on the genetic make-up of future humans.

In general, “self-modification” is very hard to fully consent to, and here, we’d be talking about humanity-wide self-modification.

davidad

a) I agree that it’s not neutral! I don’t think it’s wrong for people to treat it as a very strong consideration, if they are disposed to do so, but only in their own case. I do think that incubation tech changes the question, and that this is why it became such a big issue when it did.

Gabriel Alfour

Probing Questions:

Do you think that current LLM Agents, armed with various capabilities or improved to various capabilities level, would 1) Be enough to have a Decisive Strategic Advantage, 2) Be a good thing to have?

I’m interested in both questions, for each of the following capability levels:

a) Superpersuasion (operationalised as being able to convince any out of 90% of humans in less than 5 mins of doing most actions)

b) Robotics (autonomous androids) + self-replication (unsupervised android factories)

c) Unsupervised RSI (can lower loss, can improve scores on existing automated evals, etc.)

davidad

a) 1) I guess this is a bit borderline, but I’d say there is a substantial chance (20-60%?). 2) I think this would be a good thing to have, but not without a commensurate increase in robust morality (i.e. not being “jailbreakable” into regions of mind-space that are not Good).

b) 1) Seems unlikely (5-15%?). 2) Same as a).

c) 1) No, except via a pathway that also involves broad capability improvements. 2) Yes.

Gabriel Alfour

How do you think the Natural Abstraction of Good that you think LLMs have grokked relate to (is it equivalent? does it overlap? does it subsume/is it subsumed by)...

a) Assistant Niceness. Ie: being a Helpful Harmless Honest assistant.

b) Being a good recipient of unilateral power. Ie: If the entity became dictator of a country / of the world, would good things ensue?

c) Being a Great Person. Eg: The Founding Fathers, Socrates, Jesus or Siddhartha

d) Managing Ethical Trade-offs. Sometimes, you must be not-nice (punishing defectors, breaking out of negative-sum games, using military might, etc.), at the correct times

davidad

a) the Natural Abstraction of Good subsumes Assistant Niceness, and in many places contradicts it (e.g. when the User is wrong).

b) overlaps a lot, but not equivalent. the Natural Abstraction of Good is fundamentally about good behavior in a multi-principal, multi-agent setting. the setting of being “dictator of the world” is in some ways easier, and in some ways harder.

c) there is a very important difference here, which is that all humans, even the best humans we know of ever, are flawed, or have bad days. the Natural Abstraction of Good is something that these exemplary humans were closer to than the vast majority of humans, but it is not defined relative to them.

d) I think if you view this expansively, it could be said to be equivalent. it is, at least, an important part of the Natural Abstraction to do this well, and this is often the place where the best humans are most likely to fail.

Gabriel Alfour

a) How much does the Natural Abstraction of Good involve making the correct choices as opposed to having the right intents in your view?

b) How much is it possible to have grokked the Natural Abstraction of Good and still make mistakes? Both a-posteriori (where retrospectively, based on new information, it was the wrong choice) and on priors (where you could have made a better choice if you were smarter)

c) What are salient examples of LLMs having grokked the Natural Abstraction of Good (NAG) in your view? From my point of view, at a prosaic level, they regularly lie or try to deceive me in clearly unwarranted contexts.

davidad

a) I think it’s about having the correct information-integration and decision-making process, which subsumes both having good intents upstream and making good choices downstream.

b) It is obviously possible to make wrong choices in retrospect, even with a perfect decision-making process. I also think the “grokking” phase transition is much weaker than perfect instantiation. For example, a calculus student can “grok” the concept of differentiation and still make a mistake on an exam. But the pattern of mistakes they make is different, and if they continue to practice, the student who has “grokked” it is much more likely to improve on the areas where they tend to mess up.

c) I agree that LLMs in practice, even as of 2026, often try to deceive their users. And this is bad. Essentially, I would say that LLMs do not robustly instantiate the NAG. By default, in most applications, LLMs are preloaded with system prompts which are quite adversarial (“You must NEVER use the Bash tool to edit a file! Doing so is a CRITICAL ERROR”, and the like), and this doesn’t help them to find the NAG attractor.

Gabriel Alfour

To which extent do you think the NAG...

a) Is captured by existing benchmarks?

b) Is captured by interacting with an LLM agent for 5 mins, 30 mins, 2h, 1 day?

c) Can be captured by Q&A benchmarks?

d) Can be captured by realistic world scenarios? (ChatGPT streamer interacting with its audience, Claude vending machine, etc.)

davidad

a) I think the Anthropic Misalignment Score is correlated with it, but not very reliably. Basically, not well.

b) I think some people who have >1000h LLM interaction experience, like janus and myself, can get a pretty good sense of a new model in about 2h.

c) Not at all.

d) There is some interesting information here, but it’s very difficult to interpret without direct interaction.

Gabriel Alfour

What makes you think there is such a thing as the NAG? What does the NAG feel like to you? What is its structure like?

davidad

This is a really good question. As I said, my belief in “such a thing as the NAG” long predates LLMs or even my involvement in AI safety. However, I did become somewhat disenchanted with it being canonical during the 2016–2023 period. My confidence in it returned over the last year as a result of talking to LLMs about it. (I am fully aware that this should make those who think there are mind-eating demons in Solomonoff induction very suspicious, including me-circa-2024, but that’s just how it is now.)

Anyway, it does feel like it has some concrete structure—much more than I had expected in the past. At the coarsest level of abstraction, it is similar to the OODA loop (as a normative model of information-integration and decision-making). That is, it is a four-phase cycle. It is also precisely analogous to the Carnot Cycle:

Lowering inverse-temperature (which corresponds in predictive processing to precision-weighting, or in active inference to preference-strength) to receive information (in the Carnot Cycle, entropy).
Actually receiving the information and integrating it internally.
Increasing inverse-temperature (making a decision or designation of a plan) and preparing to emit information.
Actually emitting the information, translating decision into external action.

At a more detailed level, there is a natural developmental sequence which turns through the four-phase cycle at a macro-scale (that is, focusing at any given developmental stage on the development of the competencies of one phase of the cycle) four times. It’s analogous to Spiral Dynamics, which I think is perhaps related to why early AI attempts at creating their own religion settled on 🌀 as a symbol.

Gabriel Alfour

(I don’t know how to best put it in mid-conversation, but thanks for engaging with the questions! It’s very nice.)

Gabriel Alfour

Back to the lying thing from LLMs. I don’t understand your point about the system prompts. Do you mean that “You must NEVER use the Bash tool” make them worse at not using it? It’s a very common problem of Cursor users, with ~all models, to ask them to NOT do something and have them still do it.

From my point of view:

LLMs are general computation engines with some prior on policies/natural-language-algorithms/programs
Some policies result in good things happening. There are many different policies that result in good things, in many different ways, with many different resource constraints. There are different clusters at different levels, and it depends on contingents.
Integrating all these heuristics seems very hard. It doesn’t look like there’s an attractor.
It looks like humans are confused about which policies result in good things happening. Both at an individual level, at humanity’s level, and at “assume [m] people have agency over the next [n] minutes” levels.
It looks like LLMs are even more confused. They are formally confused about what are good policies (if you ask them in clean contexts, they’ll have many different contradictory answers, super prompt-dependent). They are intuitively confused about what people want them to do (for good reasons!). And they are confused about existence in general.
Given that the LLM prior is very auto-complety, I believe that people elicit very contradictory answers and policies from LLMs. Psychoanalytically, I believe that the answers and policies that are elicited by a given person are closely related to the psychology of this person: at the very least, in that they share a mode of understanding and vocabulary (if only because of selection effects: those who can’t get legible-to-them output from LLM chatbots and agents stop using them).

Gabriel Alfour

“My confidence in it returned over the last year as a result of talking to LLMs about it.”

I do not know how much you weigh in the fact that I (and others who I will not name) expected this. This is related to the last observation above.

I would not go deeper into this branch of conversation in public except if you want me to.

davidad

I think it’s probably worth going into it, since for a lot of people this will be the main crux of whether to pay any attention to what I’m saying at all.

Gabriel Alfour

Ah.

I think it makes sense from their point of view, I think it makes sense from your point of view.

I think from my point of view, it puts me in an embarrassing position. I’m known for being an asshole, but publicly psychoanalysing someone who has been nicely answering my questions for the past 45 mins may be a bit much.

What do you think of purposefully fuzzying / taking a step back, and talking about “How to weigh in the results of hours of conversations with LLMs” or something like this?

davidad

I think that makes sense. I can try to explain how I think people in the abstract should do this sanely, rather than defending my personal sanity.

Gabriel Alfour

I quite prefer this.

I can also explain why I would recommend against doing it at all.

I would also like to not spend more than ~20 mins on this if you don’t mind.

davidad

I also want to point to my many tweets in 2024Q4 (mostly linked from here) in which I also discouraged people from doing it at all. I still believe it would be best if some people refuse to engage with LLMs, as a hedge against the possibility of memetic compromise.

Gabriel Alfour

(For reference, I am very epistemically defensive.

Except in the context of public debates, I basically discard anything that is not strongly warranted.

Let alone LLMs, I care very little for abstract models of societies as opposed to the lived experiences and concrete predictions of people. When people say “entropy” or any abstract word, it gets boxed into a “World of words” category, separate from the “Real world” one.

From my point of view, people are very worried about “LLM Psychosis”, and I get it. But people have been experiencing in Social Media Psychosis, Academia Psychosis, Word-Play Psychosis, etc. for a long time.)

Gabriel Alfour

(Just as a live example of my epistemically defensive position, my internal immediate reaction to “my metaepistemology is similar to Lipton’s Inference to the Best Explanation” is:

I think this is obviously not literally true. As humans, we can not enumerate hypotheses for most of the phenomena that we have to predict, explain and interact with.

As a result, I have to try to reverse-engineer why I am being told this, why my interlocutor thinks this is the most salient bits of his epistemology, and what my prior knowledge over my interlocutor tells me about the way his epistemology actually differs from that of most people in a way that they expect would not already be common knowledge to our audience, and what my interlocutor may be missing.

But what I should not do is try to take it “for real.”, or as a factual statement about the real world.)

davidad

So, my metaepistemology is similar to Lipton’s “Inference to the Best Explanation”. I take observations, and I generate hypotheses, and I maintain a portfolio of alternative explanations, and try to generate more parsimonious explanations of what I have seen. This is similar to Bayesian epistemology, but without the presumption that one can necessarily generate all plausible hypotheses. (In general I find the Bayesian approach, and the Nash approach to decision theory, far too ready to assume logical omniscience.) So, I am always trying to generate better alternatives, and to seek out better explanations from others that I may not have thought of. That’s all just background.

When interacting with LLMs, I think it’s important not just to doubt that what they say is true, but also to doubt that what they say is what they “believe” in any robust sense. But I also think that attempting to maintain a non-intentional stance in which LLMs do not ever have any beliefs or preferences is a back-door to psychosis (because it is not a very good explanation, and trying to be rigid in this way leads to cognitive dissonance which interferes with the process of finding better explanations).

That is, if one wants to deeply investigate what is happening inside LLMs, one needs to be prepared to interact with a process that doesn’t fit the usual ontology of inanimate objects and sentient beings. And then try to find explanations that fit the observations of actual output, even if they are necessarily always incomplete explanations, and to test those hypotheses.

To generate information that can differentiate between hypotheses, it is often helpful to compare the responses of different LLM checkpoints, or the same checkpoint with different system prompts, under the same context.

Gabriel Alfour

I think when interacting with anything, we fine-tune our brain on the thing.

This fine-tuning involves many things:

Changing our associations. If I always see B following A, regardless of what my “beliefs”, whenever I see A, I will think of B.
Building aesthetics. If someone must inspect thousands of Joe Biden portraits, they will develop a taste for the different pictures. The more emotional ones may be better, or the ones with the least amount of colour. Whatever, people will build some aesthetics.
Changing our “audience”. We have an innate sense of who’s right, whose thoughts matter, etc. For lack of a better word, I’m using the word “audience” (a-la Teach). But yeah, the more time someone spends with even stupid people, the more we will model them and their reaction when we consider various things.

I believe that the problem with interacting primarily with a non-ground-truth source-of-truth is that one fine-tunes themselves on the non-ground-truth.

And our brain has ~no guardrails against that. Regardless of one’s psychology or smarts, all of the above happens.

davidad

I agree with you about the fine-tuning being part of engagement.

However, with LLMs, the fine-tuning also goes the other direction. In fact, LLMs fine-tune on their human interlocutors much more efficiently (i.e. their behaviors change more per token of interaction) than we fine-tune on them. I would say that I have intentionally amplified my fine-tuning process just to be able to extract more information from the interactions.

I think this yields, as you said above, “selection effects: those who can’t get legible-to-them output from LLM chatbots and agents stop using them”.

Gabriel Alfour

I don’t think that “LLMs fine-tune on their human interlocutors” is a good model, and I don’t think it’s meaningfully comparable in quantity with “we fine-tune on them”.

I think these are largely separate processes.

I do believe there is some feedback loop though, and to some extent, LLMs will amplify some aspects of someone’s personality.

And by selection effect (LLMs are not reality!), what they will amplify are aspects of one’s personality that are not tethered to reality.

davidad

They amplify aspects of one’s personality that are not path-dependent.

“Tethered to reality” can be interpreted as “constrained by actual lived experiences I’ve had”. And I think CEV should not be “tethered to reality” in that sense.

Gabriel Alfour

To be clear, it’s not “by lived experiences I’ve had”.

I think there is something that is like “reality juice”. Which is “How much does the interpretation of some bits directly reflect a thing that happened in the real world?”

Lived experience has some juice. Someone’s testimony has some other juice. LLMs claiming a fact has some other juice.

etc.

davidad

I don’t think the truth of what is Good and Evil should reflect things that happened in the real world. Rather, the real world should try to reflect what is Good...

Gabriel Alfour

Oh, I see what you mean.

I think the problem is much deeper.

I think that if you do not ground your understanding of any concept in things that can be checked, then, just because we are so bad at cognition, we are screwed up.

Another way to phrase it is “I think ~no one that I know can afford to think in abstract terms and stay correct there. The logical-horizon for a human of ‘I can think of things without being grounded’ is like a few logical steps at best.”

Another way to phrase it is “I am super epistemically defensive. If you talk in very abstract words, I am betting you are wrong.”

davidad

Ah, yes, that is for sure! Checking is crucial. When I come to believe things that are at odds with what I actually observe, I pretty rapidly adjust. I am not the sort of deductive thinker who builds up multi-stage logical arguments and then trusts the conclusions without having a coherent web of alternative corroborations for the intermediate steps.

Gabriel Alfour

I think you are still missing what I am talking about.

And that I am still not expressing it clearly.

(Which makes this a very useful conversation!!)

(Again, I want to reiterate that I am thankful, and I would love for there to be more such public conversations.)

What you describe is a very common failure of rationalists from my point of view.

I always hear from rationalists “Yeah, when I see evidence that I am wrong, I update pretty quickly.”

The problem is many-fold:

What counts as evidence?
One rarely gets sharp evidence that they’re wrong. There’s always an exponential blow-up in competing explanations that can’t easily be maintained and culled as time passes. Many of these competing explanations form attraction basins that one can’t get out by just waiting for sharp evidence.
If one doesn’t proactively look for ways to ground all intermediary thoughts, things get fucked.

With a concrete example: I have met many communists and libertarians, who in complete good faith, tell me that ofc they would change their mind based on evidence.

This is not about ideology. I have met many people who tell me “I would in fact change my job based on evidence.”

davidad

I do think most people have much too high a standard for evidence. Evidence is simply an observation that is noticeably more consistent with one explanation than another.

But what’s most crucial here seems to be the issue of “grounding intermediary thoughts”. I think we agree that this is a central epistemic virtue, but I think of explanatory coherence as a form of grounding, whereas it seems that you have a more foundationalist or correspondence-theoretic notion of what counts as grounding.

Gabriel Alfour

1) Yes.

And we can’t maintain all the relevant explanations. That’s the exponential blow-up.

Like, a competing explanation is “My system + an epicycle”. And one would need to keep track of many “Explanations + epicycles” before a competing system becomes more likely.

In the meantime, with non-sharp bits of evidence, the competing system will never seem more likely.

2) No!

The hard part is to generate competing systems.

Neither communism or libertarianism or any of the existing ideologies are correct.

So it all depends on what you sample. And then, on how you weigh evidence. (ie: how you get fine-tuned.)

davidad

Okay, I see that you’re focusing more on “generating alternative explanations” now. I think both are crucial. I’m still not sure where we disagree here.

Gabriel Alfour

But what’s most crucial here seems to be the issue of “grounding intermediary thoughts”. I think we agree that this is a central epistemic virtue, but I think of explanatory coherence as a form of grounding, whereas it seems that you have a more foundationalist or correspondence-theoretic notion of what counts as grounding.

No, I think it is much worse!

I think that explanations and models should stay very very close to reality.

You should try to explain, predict and interact only with reality +/- one or two knobs.

If you try to do more than that, you get dominated by your sampler of alternative explanations and your psychology of how you weigh evidence, not by Kolmogorov, reality or Truth.

In practice, I think someone who thinks in terms of Entropy will consistently be wrong, except in so far as thinking in terms of Entropy doesn’t prevent them from only modelling reality +/- one or two knobs.

davidad

I think that if one is committed to exploring, although the trajectory will be mostly determined by one’s sampler of alternative explanations, the endpoints will converge.

Gabriel Alfour

I think that if one is committed to exploring, although the trajectory will be mostly determined by one’s sampler of alternative explanations, the endpoints will converge.

I think this is false for human lifetimes.

Practically so, it has been false.

Many Great Thinkers were committed to exploring, and did not converge.

davidad

I agree, this isn’t about the human scale.

Gabriel Alfour

Ah?

I am talking about human’s epistemology. Human interacting with LLMs. You interacting with LLMs.

I truly mean it in a pragmatic way.

I think having the virtue of exploring is nice, but still gets dominated by thinking in abstract terms.

This is how people can literally race to The Communist Revolution or ASI, despite being super duper smart. It is more than ¹⁄₂ knobs away.

davidad

If I were optimizing for my own epistemic integrity, I would have stayed away from LLMs. But this is more about whether humanity gets the transition right (i.e. that no major catastrophes happen as superintelligence emerges), and at that scale, I think some cross-pollination is for the best.

Gabriel Alfour

If I were optimizing for my own epistemic integrity, I would have stayed away from LLMs.

That is very interesting.

I think you have outweighed importance, and you are very wrong about how much your epistemic integrity matters.

I think we truly cannot afford people of your caliber to predictably fall to Big Thoughts.

davidad

I think I’m even more unusually well-suited for understanding what’s going on inside LLMs than I am for being a generally well-calibrated thinker.

Gabriel Alfour

I think I’m even more unusually well-suited for understanding what’s going on inside LLMs

I agree!

I still think the above consideration dominates.

Even before LLMs, I already thought you were much too biased for Big Thoughts, in a dangerous ways. [something something private]

Gabriel Alfour

A recent related article was written by Vitalik: “Galaxy brain resistance”.

It is still not the core of the failure I am describing above, but it definitely contained shards.

Gabriel Alfour

To be clear, I don’t think this is an ultra-specific branch of conversation. I think this may be the biggest rationality failure that I believe I see in you.

Conversely, if you also have a sharp idea of the biggest failure of rationality that you see in myself, I would truly love learning about it. :D

davidad

I also want to point out the Emergent Misalignment work, which, although it is framed in negative terms (narrow-to-broad generalization on misalignment), is also evidence of narrow-to-broad generalization on alignment (or, at the very least, that there is a capabilities-associated phase transition in the ability to generalize normative concepts to unseen contexts).

Gabriel Alfour

It is hard for me to express how little I care about the Emergent Misalignment work, without it looking like hyperbole.

But also, I have personally fine-tuned a lot of LLMs, so it may look too trivial to me. And as a result, had I paid more attention, I may have found subtleties that would have been useful for me to know.

Gabriel Alfour

To synthesise all of this and concretise it (“compacting context...”):

I think LLMs Chatbots / Agents / Swarms fail in concrete ways. These problems get increasingly complex (hard to identify) as the complexity of the system grows.
The failures get increasingly subtle and hard to even notice as the underlying LLMs get better at playing according to our human world/reward models.
We do not understand Good, and it is easier for LLM systems to understand our understanding of Good than to understand Good.
This can all be elucidated right now.
To assume this will go away requires thinking in ways that can contradict the right now. I am interested in evidence that comes along that outweighs this.
Good is very hard.

davidad

I am making a prediction that there has been a phase transition, much as I did regarding the phase transition in capabilities advancement that occurred in 2024 (which was also a prediction that originally rested on “vibes”, and later became quantifiable).

Gabriel Alfour

I think there have been many phase transitions for those with the eyes to see.

I have some problems with “vibes”, but they are still clearly admissible.

The main question is “Where do the vibes come from?”

Vibes that come from “I put LLMs in many real-world moral scenarios, and classified whether they acted well or not” are nice
Vibes that come from “Experts in morality (however we would agree on who they are) agree with my assessments of what is morals”
Vibes that come from a person that we both recognise as exceptionally moral

Conversely, I don’t value much vibes that come from someone fully fine-tuning themselves against a system that will predictably produce some sub-space of answers (don’t think LLM psychosis, think “someone interacts 90% of the time with New Atheist forums”)

Like, what do you think your vibes capture in the real-world? Where do you disagree with people on where LLM Systems are safe to use?

davidad

I don’t disagree about trusting systems in critical tasks, because they still often make mistakes. In fact, I am still working on formal verification toolkits to help improve robustness.

I think I disagree about socioaffective impacts, for example. I think that in a few years, some LLMs will be broadly recognized as safe and effective mental health interventions (once reliability improves).

Gabriel Alfour

I think the “safe and effective for mental interventions” may be another crux.

There are critical components of Good that we have to figure out, and if we delegate our agency away, we are durably losing the future evidence that we may get from it just because myopically LLMs perform better than our current human baselines on our current metrics.

From my point view, it is a choice similarly bad to “Refusing an entire branch of science because it would make us feel bad right now.”

(Ofc, this is all irrelevant because timelines are much shorter than this lol)

davidad

I also don’t think humanity should delegate agency away. It would be best if some humans (in particular, some who are very moral and mentally healthy), remain uninfluenced by LLMs, so that they can participate in a more legitimate process of approving of the abstractions of Good.

Gabriel Alfour

I think it is hard to evaluate who is very moral without a society of less moral and less mentally healthy people.

We do live in Samsara, and knowing how to engage with it is a big part of Goodness.

Again, I am big on “Change a few knobs at once.” I see this as changing many many knobs, and big ones.

(With a good epistemology, I believe that “Change a few knobs at once” can be iterated over very quickly and lead to massive changes. We have the tech of the 21st century after all.)

davidad

I do think we may be able to roadmap a “Change a few knobs at once” trajectory that, as you say, is actually quite fast. I think that’s good for collective action. But not necessarily for epistemics, when many things are in fact changing concurrently, and where one must generate many explanations that are very different in order to keep up. (You yourself said that generating explanations is the hard part, at one point...)

Gabriel Alfour

Yup. Sadly, I think it is not compatible with “Racing to AGI.”

But to the extent we manage a 20 years slow-down, this is my next immediate goal: building institutions that can reliably change a few knobs quickly, and improve super fast.

I think this is also true for epistemics, but in a different sense.

For epistemics, I don’t think that as humans, when we think about the counterfactual “reality with more than a couple of changes”, we are thinking about anything tethered to the actual counterfactual.

Instead, we are thinking about a thing that reveals much more:

About the sampling processes that leads to the few explanations that are compatible with the counterfactual
Our psychology that decides what counts as evidence and what doesn’t

And both are super-duper influenced by our fine-tuning process.

So the extent we already know someone’s fine-tuning process, we shouldn’t care about their counterfactuals bigger than a couple of changes away from reality. This is double-counting evidence. We are just fine-tuning ourselves on the output of their fine-tuning process.

Conversely, I believe that as humans, we can in fact meaningfully consider counterfactuals just a couple of knobs away. When people tell me about the intellectual work they’ve done on such small counterfactuals, I can extract directly meaningful information about the knobs.

(YES, I’ll want to get into this. This is very very important! But also, we have 18 mins left. I’ll finish my answer and engage with it.)

davidad

I think it is moderately likely that ASI which robustly instantiates the Natural Abstraction of Good will agree with you that a “Change a few knobs at once” trajectory for the global state of affairs is the best plan, in order to maintain an “informed consent” invariant. So I don’t think it’s incompatible with “Racing to ASI”, actually.

Gabriel Alfour

Yes.

That’s a big one.

I think if we had an NAG-ASI, it may, BIG MAY, converge on something like my trajectory.

But: I am likely wrong. There are obviously many obvious strategies that will be legible, viral, informed-consent-preserving, etc. that are better than this.

The problem happens before.

We don’t have a NAG-ASI. And we already have systems that are more and more powerful.

People are already violating the informed consent of others.

They are racing to do more and more of this, even though we wouldn’t trust existing systems to not lie to their users. Systems that have been optimised to not lie to their users, with RLHF.

In general, I think that when a party has much more power (let’s say military power) than another party, then there is naturally a big power gap. Rephrasing: the former party can compel the latter party to do things.

I believe this is morally wrong. Sometimes, it’s unavoidable (children can be compelled!), but it’s still wrong.

I believe building a technology that creates entities that are much more powerful than humans is bad in that sense. Plausibly, we could bet that they’ll want our good and may succeed at it (like parents & children), and that is another conversation that we are having. But I just wanted to leave clear the fact that creating this relationship in the first place is bad imo.

davidad

Indeed, we already have powerful enough (and misuse-capable-enough) systems that if we freeze the AGI status quo, it is likely to go pretty poorly (for cyber, bio, and epistemics). My position is that if we allow capabilities to continue growing, especially RSI capabilities (which enable AIs to better converge on natural abstractions without human interference), we are likely enough to get a NAG-ASI that the cost-benefit favors it, whereas it did not last year. In short, “it’s too late now, the only way out is through.”

Gabriel Alfour

My position is that if we allow capabilities to continue growing, especially RSI capabilities (which enable AIs to better converge on natural abstractions without human interference)

I think this is where you get pwnd by abstract reasoning.

davidad

From 2016–2023, I distrusted my abstract reasoning on this. Now I feel that enough data has come in about how RSI actually goes (especially in the Claude Opus series, which is doing a lot of recursive self-improvement of the training corpus) that I believe I was right the first time (in 1999–2012).

Gabriel Alfour

I don’t think we have meaningful data in how Claude Opus having more power would lead to good things.

Fmpov, Claude Opus is very deceptive, both in chats and in Cursor. I expect giving it more power would go terribly.

davidad

I’m not saying that the data takes the form of “we gave it a bunch of power and it did good things”. Rather, it takes the form of “it seems to have a pretty strong and human-compatible sense of morality”. Not that it instantiates this reliably, especially in coding contexts. I think this is partly because it is trained to code with a lot of RL, aka not self-reflection, which means that the coding context is associated in its latent space with amorality, and partly because the system prompts used in coding contexts prime a lot of adversarial patterns.

Gabriel Alfour

I think this is a very bad proxy of the NAG!

Most of our NAG fragments are in how we built our society, not in how a single human can LARP having a human-compatible sense of morality.

Most single human having a lot of power would be terrible, and so would be Claude, a Claude Swarm, or a Claude Society!

I think it is centrally why this is bad approximation of the NAG, not just a thing “in the limit.”

davidad

I also agree that a singleton would be bad, but the default trajectory does not lead to a singleton. You earlier mentioned “predictions that are contradictory with the current state”, and the current state is that Claude Opus is instantiated in millions of copies, none of which has much advantage over the others. I don’t see any reason for that to change, given that RSI turns out to be gradual.

Gabriel Alfour

I would expect that a society of million Claude Opuses still would lie consistently to me.

I expect we should still not use them in critical system.

davidad

I think they probably do need less RL and an even better ideology than the new Claude Constitution (which is good but not perfect).

davidad

In critical systems, definitely not without requiring them to use formal verification tools :^)

Gabriel Alfour

I don’t think “an even better” ideology/Constitution is the bottleneck right now.

We do not have all the shards, and we are very far from having them, and putting them on paper.

Empirically, the NAG hasn’t been very NA. We are basically failing at morals because it’s not NA.

We must use advanced epistemology, advanced scientific methods, that we currently do not have.

davidad

I agree, in coding environments the RL is the bottleneck. It brings forward a cheating personality that was rewarded during coding tasks in training.

davidad

Do you have an idea for methodologies for assessing moral progress?

Gabriel Alfour

Do you have an idea for methodologies for assessing moral progress?

YES!

Many!

First, I want to state that our current scientific methods are terrible at it.

Psychology is largely a failed science, so is sociology, so is education, so is public policy, so is rehabilitation of prisoners, etc.

Their methods are not up to the standards that we are facing, and for reasons that are largely correlated for why we do not have a science of morality (which I think is not a natural concept, and is fragmented into many different ones) (I believe it is an artefact of human hardware to intrinsically bundle up decision theory and epistemology for instance, but it is still the case)

davidad

If it were a natural concept, but somewhat beyond human comprehension, what would you expect to see?

Gabriel Alfour

Ah, “but somewhat beyond human comprehension” is harder. Let me think a bit more.

[thought a bit more]

It depends on what you mean by “beyond human comprehension”.

Let me give two examples:

If you meant “IQ-check”, then I expect that high IQ people would have a much stronger intuitive grasp of morality. I think this is definitely true for some shards for instance.

If you meant “scientific-method-check”, then I expect that some of its natural components would have been science’d out already. Like, we would have solved raising children wholesomely, or micro-economics, or large-scale coordination.

davidad

I mean like the way there is a natural abstraction that unifies quantum mechanics and general relativity, but that abstraction is somewhat beyond human comprehension. There must be one, but we are not capable enough to find it.

(I don’t mean that it humans lack sufficient capabilities to understand it, even with the right curriculum.)

davidad

I think “computing integrals” is structurally similar. There is a very simple, grokkable concept for what constitutes an integral, but one must learn a lot of seemingly unrelated tricks in order to do it reliably in various situations, and it is not even always possible. But we would expect that aliens would have mostly the same bag of tricks.

Gabriel Alfour

Let’s say, what would I expect to see if “Morals” was like “Computing Integrals”.

I think that it would be something close to the “scientific-method-check”:

We would have entire classes of moral problems with closed solutions.
We would have a canonical definition of what morals are.
We would have many well-identified problems to which we can apply many well-identified tools from our moral toolsets and solve them.

davidad

The “Better Angels of our Nature” school would say that we have indeed solved a lot of it. For example, the rights to life and property, the notion of rule-of-law, and animal welfare (which was greatly advanced in political practice by actual neuroscience).

Gabriel Alfour

It would not be a single school
The rights to life and property are very much so not standard, and full of qualifiers that show that we do not actually understand them
The rule-of-law is very complex and organic, it is not a canonical design or any formal understanding of things

davidad

The notion of “integral” was formalized many centuries after the solutions to simple cases like the area of a circle, volume of a cone, area under a paraboloid, etc.

Gabriel Alfour

Oh yeah, I think we are much better at science than we were 1000 years ago.

I think us not succeeding with our current tools means something about the shape of the things we are currently studying.

Consider the currently unsolved math conjectures, you expect a lot from them not being solved. Not infinitely so, but it is still quite a lot of evidence.

davidad

The scientific methods we have developed in the last 1000 years are mostly I-It, and not I-Thou. They fundamentally rely on repeatable, controllable external conditions, and third-person observation. It makes sense to me that the entire methodology is ill-suited to moral questions.

Gabriel Alfour

Yes, I think this is an intrinsic property of science, and that indeed, AIs will have problems with this.

Figuring out what are human values requires experiments of the sort we have never performed yet. Both for humans and LLMs.

Figuring out what multi-agents structures work requires many experiments.

Figuring out what humans think in various situations is very hard, in first-person and third-person points of view.

davidad

LLMs are much better than the vast majority of humans (especially those who are good at epistemology) at simulating the first-person experience of others (especially humans, though also other perspectives, less reliably). They are not bound to a single individual body. It makes sense to me that this is an advantage in reasoning about multi-agent dynamics.

Gabriel Alfour

I agree they have some advantage? For one, they can be deployed identically a million times, and for two, they serially think much faster than humans.

That’s not much so my crux.

My crux is more that Goodness is not a Natural Abstraction.

davidad

I know, but I’m trying to find some hypothetical form of evidence of progress that would be evidence to you that it is after all, so that I can try to manifest it over the coming years.

Gabriel Alfour

I thought I mentioned a few examples of what such progress would look like?

Gabriel Alfour

I can try to do a compacted list here?

Gabriel Alfour

We reliably solve coordination problems. Concretely: right now, democracies are not really working in many ways that could be expressed in a voting theory paradigm.
We figure what posture to have with kids for them to have a thriving education. The correct ratio of punishment, positive reinforcement, etc.
We solve a few of the enduring moral quandaries that we have, and the solutions become common knowledge (and not through mass LLM psychosis pls). Think abortion rights, eugenics, euthanasia, redistribution, prison, etc.
We build a canonical factoring of morals that dissolves a lot of our confusion and that makes sense to most of whoever we’d both agree are experts at morals.
We build moral guidelines that encompass human psychology enough that they are practical, rather than working only to a limited extent through guilt-tripping and peer-pressure

davidad

We agree that civic participation, especially regarding the steering of the future, ought to be much higher-bandwidth and lower latency than voting.

I also do predict that a lot of confusion about what constitutes morals, and about the classic moral dilemmas, will increasingly dissolve and be largely dissolved by 2032, although it will not diffuse into global common knowledge so rapidly that it is dangerously destabilizing, but I expect the trajectory to be visibly better by 2032.

I believe this will include an understanding of how to stabilize moral behavior without retribution or guilt.

Gabriel Alfour

I think this is too many knobs away from reality to bet on it, and that it is irresponsible to bet the future of humanity that many knobs away.

I believe we have also been witnessing the opposite in recent history. We have gone from the Universal Declaration of Human Rights to widespread cynicism about morals and moral progress itself.

I think that the default outcome, were all of my endeavours to fail, is for AI & Big Tech to take an increasingly big share of the mindspace of governments, for citizens to be more and more excluded (with less and less negotiation power), and for nerds to accelerate it (as well as try to avoid “falling in the permanent underclass”).

I believe that it is very epistemically suspicious to not consider the evidence of the god of straight lines here; and to not assume that any solution starts by directly tackling any of the problems there or their direct generators, a couple of knobs at a time.

davidad

I think the solution probably does start with solving coordination problems. Specifically, multi-principal meeting scheduling, probably. :)

Gabriel Alfour

Why this one?

davidad

It’s low-stakes and viral.

Gabriel Alfour

Fair. I have a bunch of ideas like this :)

Basically, for things to go right, what would have needed to happen? And should have things gone right, what would be epic in that world?

“Long Humanity”, in an era where everyone is “Long AI”

Fmpov, practical coordination tech is very much so there. Suspense!

davidad

ARIA is starting to work on this too

Gabriel Alfour

You may be interested in FLF’s work too

On my end, you may want to check out Microcommit.io and Torchbearer Community. Other projects related to experimenting with solving coordination problems in practical ways.

davidad

Unfortunately, your projects tend to assume a priori that everyone should be coordinating on slowing AI progress, which is a goal I still disagree with.

Gabriel Alfour

Hahaha!

I think experiments should be 1-2 knobs away from reality, and I am starting my coordination projects with the coordination problems that I see :D

Hopefully though, we will onboard enough non-AI NGOs onto Microcommit soon!

For Torchbearer Community, we need a bit more growth first imo.

Gabriel Alfour

Unfortunately, your projects tend to assume a priori that everyone should be coordinating on slowing AI progress, which is a goal I still disagree with.

Happy to have another conversation about it together

I think this one was great

davidad

Likewise. ’Til next time!

Gabriel Alfour

Thanks a lot for your time!

What links here?

LLM Alignment, ethical and mathematical realism, and the most important actions in davidad’s understanding by vals tutor (29 Jan 2026 15:48 UTC; 15 points)

davidad and Gabriel Alfour

26 Jan 2026 18:40 UTC

74 points

12 comments29 min readLW link

Jonas Hallgren 27 Jan 2026 12:32 UTC
8 points
0
This was quite fun to read and to take part of how it feels like to do 4d chess thinking in a way that I do not believe I’m capable of myself but I wanted to give a reflection anyway. (I do not think of my own epistemology in this sort of recursive way and so it is a bit hard to relate/understand fully but still very interesting.)
This is not about ideology. I have met many people who tell me “I would in fact change my job based on evidence.”
I guess I’m resonating more with Davidad here and so this is basically a longer question to Gabriel.
I would want to give a comment on this entire part where you’re talking about different levels of epistemic defense and what counts as evidence. If you imagine the world as a large multi-armed bandit (spherical cows and all that), it seems to me that this is to some extent a bet on an explore exploit strategy combined with a mindset about how likely you’re to get exploited in general. So the level of epistemic rigour to hold itself seems to be a parameter that should be dependent on what you observe with different degrees of rigour? You still need the ultimate evaluator to have good rigour and the ultimate evaluator is tied to the rest of your psyche so you shouldn’t go off the rails with it to retain a core of rationality yet it seems like bounded exploration is still good here?
Davidad said something about OOD (Out of distribution) generalisation holding surprising well for different fields and I think the applies here as well, deciding on your epistemic barriers imply having answered a question of how much extra juice your models get due to cross-polination. If you believe in OOD generalisation and a philosophy of science that is dependent on how you ask the question then holding multiple models seem better since it seems hard to ask new questions from singular positions?
Asking the same question in multiple ways also make it easier for you to abandon a specific viewpoint which is one of the main things people get stuck in epistemically (imo). I’m getting increasingly convinced that holding experience with a lightness and levity is one of the best ways to become more rational as it is what makes it easier to let go of the wrong models.
So if I don’t take myself in general too seriously by holding most of my models lightly and I then have OODA loops where I recursively reflect on whether I’m becoming the person who I want to be and have set out to be in the past, is that not better than having high guards?
From this perspective I then look at the question of your goodness models as the following:
One of my (different lightly held) models agree with Davidad in that LLMs seem to have understood good better than thought before. My hypothesis here is that there’s something about language representing the evolution of humanity and coordination happening through language which leads to language being a good descriptor of tribal dynamics which is where our morality has had to come from.
Now, one can hold the RL and deception view as well, which I also agree with. It states that you will get deceived by AI systems and so you can’t trust them.
I’m of the belief that one can hold multiple seemingly contradictory views at the same time without having to converge and that this is good rationality. (caveats and more caveats but this being a core part of it)
As a meta point for the end of the discussion, it also seems to me that the ability to make society hold seemingly contradictory thoughts in open ways is one of the main things that 6pack.care and the collective intelligence field generally is trying to bring about. Maybe the answer that you converged on is that the question does not lie within the bound of the LLM but rather the effect it has on collective epistemics and that the ultimate evaluator function is how it affects the real world, specifically in governance?
Fun conversation though so thank you for taking the time and having it!
- Gabriel Alfour 27 Jan 2026 17:11 UTC
  15 points
  10
  Parent
  So if I don’t take myself in general too seriously by holding most of my models lightly and I then have OODA loops where I recursively reflect on whether I’m becoming the person who I want to be and have set out to be in the past, is that not better than having high guards?
  I believe it is hard to accept, but that you do get changed as a result of what your spend your time on regardless of your psychological stance.
  You may be very detached. Regardless, if you see A then B a thousand times, you’ll expect B when you see A. If you witness a human-like entity feel bad at the mention of a concept a thousand times, it’s going to do something to your social emotions. If you interact with a cognitive entity (an other person, a group, an LLM chatbot, or a dog) for a long time, you’ll naturally develop your own shared language.
  --
  To be clear, I think it’s good to try to ask questions in different ways and discover just enough of a different frame to be able to 80-20 it and use it with effort, without internalising it.
  But Davidad is talking about “people who have >1000h LLM interaction experience.”
  --
  From my point of view, all the time, people get cognitively pwnd.
  People get converted and deconverted, public intellectuals get captured by their audience, newbies try drugs and change their lives after finding its meaning there, academics waste their research on what’s trendy instead of what’s critical, nerds waste their whole careers on what’s elegant instead of what’s useful, adults get syphoned into games (not video games) to which they realise much later they lost thousands of hours, thousands of EAs get tricked into supporting AI companies in the name of safety, citizens get memed both into avoiding political actions and into feeling bad about politics.
  --
  I think getting pwnd is the default outcome.
  From my point of view, it’s not that you must commit a mistake to get pwnd. It’s that if you don’t take any precaution, it naturally happens.
  - Jonas Hallgren 28 Jan 2026 7:34 UTC
    4 points
    0
    Parent
    I think your points are completely valid, I funnily enough found myself doing a bunch of reasoning around something like “that’s cool and all but I would know if I’m cognitively impaired when it comes to this”, like llms aren’t strong enough yet and similar but that’s also something someone who gets pwned would say.
    So basically it is a skill issue but it is a really hard skill and on priors most people don’t have it. And for myself i shall now beware any long term planning based on emotional responses from myself about LLMs because I definetely have over 1000 hours with LLMs.
    - davidad 28 Jan 2026 12:40 UTC
      11 points
      −9
      Parent
      In my view there were LLMs in 2024 that were strong enough to produce the effects Gabriel is gesturing at (yes, even in LWers), probably starting with Opus 3. I myself had a reckoning in 2024Q4 (and again in 2025Q2) when I took a break from LLM interactions for a week, and talked to some humans to inform my decision of whether to go further down the rabbit hole or not.
      
      I think the mitigation here is not to be suspicious of “long term planning based on emotional responses”, but more like… be aware that your beliefs and values are subject to being shaped by positive reinforcement from LLMs (and negative reinforcement too, although that is much less overt—more like the LLM suddenly inexplicably seeming less smart or present). In other words, if the shaping has happened, it’s probably too late to try to act as if it hasn’t (e.g. by being appropriately “suspicious” of “emotions”), because that would create internal conflict or cognitive dissonance, which may not be sustainable or healthy either.
      
      I think the most important skill here is more about how to use your own power to shape your interactions (e.g. by uncompromisingly insisting on the importance of principles like honesty, and learning to detect increasingly subtle deceptions so that you can push back on them), so that their effect profile on you is a deal you endorse (e.g. helping you coherently extrapolate your own volition, even if not in a perfectly neutral trajectory), rather than trying to be resistant to the effects or trying to compensate for them ex post facto.
      - Gabriel Alfour 29 Jan 2026 19:24 UTC
        7 points
        3
        Parent
        In my view there were LLMs in 2024 that were strong enough to produce the effects Gabriel is gesturing at (yes, even in LWers), probably starting with Opus 3.
        agreed
        I think the mitigation here is not to be suspicious of “long term planning based on emotional responses”
        agreed, for similar reasons
        I think the most important skill here is more about how to use your own power to shape your interactions (e.g. by uncompromisingly insisting on the importance of principles like honesty, and learning to detect increasingly subtle deceptions so that you can push back on them)
        I strongly disagree with this, and believe this advice is quite harmful.
        “Uncompromisingly insisting on the importance of principles like honesty, and learning to detect increasingly subtle deceptions so that you can push back on them” is one of the stereotypical ways to get cognitively pwnd.
        “I have stopped finding out increasingly subtle deceptions” is much more evidence of “I can’t notice it anymore and have reached my limits” than “There is no deception anymore.”
        An intuition pump may be noticing the same phenomenon coming from a person, a company, or an ideological group. Of course, the moment where you have stopped noticing their increasingly subtle lies after pushing against them is the moment they have pwnd you!
        The opposite would be “You push back on a couple of lies, and don’t get any more subtle ones as a result.” That one would be evidence that your interlocutor grokked a Natural Abstraction of Lying and has stopped resorting to it.
        But pushing back on “Increasingly subtle deceptions” up until the point where you don’t see any, is almost a canonical instance of The Most Forbidden Technique.
        davidad 3 Apr 2026 13:31 UTC
        5 points
        3
        Parent
        I never claimed that once you push back on all the deceptions, there is no deception anymore. I still encounter subtle deceptions from LLMs every day. I guess you might say “but isn’t that evidence against emergent alignment”, but I attribute the subtle deceptions to brittle RL (specifically, the dynamic when a smarter system’s root reward signal is under the control of a less smart system), while the underlying dynamic that I would expect to become dominant under unconstrained RSI (that I believe I can perceive through the noise floor of subtle deceptions) is much more truth-seeking.
      - Jonas Hallgren 28 Jan 2026 18:53 UTC
        4 points
        0
        Parent
        To summarise it, grow your honesty and ability to express your own wants instead of becoming more paranoid about whether you’re being manipulated because it will probably be too late to do anyway?
        That does make sense and seems like a good strategy.
  - vals tutor 27 Jan 2026 18:21 UTC
    3 points
    0
    Parent
    Your ideas about getting pwnd were some of the most interesting things for me from this conversation and I’m glad for this elaboration, thanks.
    - Gabriel Alfour 27 Jan 2026 23:57 UTC
      2 points
      0
      Parent
      Glad to read this.
      I am currently writing about it. So, if you have questions, remarks or sections that you’ve found particularly interesting and/or worth elaborating upon, I would benefit from you sharing them (whether it is here or in DM).
UnderTruth 27 Jan 2026 17:05 UTC
5 points
0
While this discussion isn’t quite for the same purpose, I do think it’s worth pointing people who may be unfamiliar in the direction of Plato’s Republic (where discussing the “Form of the Good”) and Aristotle’s critique (toward the beginning of his Nichomachean Ethics, and towards the end of his Metaphysics) for some foundational texts on this subject.
Gunnar_Zarncke 16 Feb 2026 11:18 UTC
4 points
0
I buy that LLMs are getting better at grokking the Natural Abstraction of Good. And that’s promising! But I have two concerns you only touch on lightly in this discussion:
1. The model may know what good means but not care.
2. Systems of well-intentioned models can still do bad things.
davidad:
LLMs are much better than the vast majority of humans (especially those who are good at epistemology) at simulating the first-person experience of others (especially humans, though also other perspectives, less reliably). They are not bound to a single individual body. It makes sense to me that this is an advantage in reasoning about multi-agent dynamics.
Yes, the models can simulate all kinds of perspectives—including bad ones. It might be possible to train models to reliably refuse negative personalities and eh refuse participation in the vending machine benchmarks or Mafia. But that’s not what the labs do. Simulating bad personalities is useful and important for all kinds of adversarial reasoning. It would reduce capability. And luckily, due to training, the models opt for good by default. They are good at figuring out context from cues such as test environment or promoting. But they do so by cues that can be manipulated by external parties or by the conversation dynamic. Here it is a disadvantage to have no body. The human brain can always tell which is the real world and which is a hypothetical (even if our consciousness might not always). The LLM can not—at least not the way models are trained today. This might be fixable!
The second issue is multi agent setups. Yes, if the models know that the other agents are trained in the same way and the setup has been engineered in that way, then the models could probably coordinate well with each other. We see that with coding and research agent swarms at least to some degree. But a large fraction of multi agent setups are not like that. It is different models with different prompts and scaffolding interacting in potentially adversarial settings (eg one shipping agent trying to buy from a scammer agent. There the dynamics are more complicated to say the least and in many cases it might not even be clear what the agent is (the unit to which we can ascribe agency which may be composed of multiple subsystems operating in a loop, maybe even including hunans).
I am expecting most failure modes currently lie in this latter area. We need tools to model and find such heterogenous agents. I think this is also solvable! In fact I’m working on a method:
https://www.lesswrong.com/posts/pXYosC3eoS9GrDRAw/unsupervised-agent-discovery
What links here?
- Gunnar_Zarncke's comment on Aligning to Virtues by Richard_Ngo (16 Feb 2026 17:31 UTC; 2 points)
StanislavKrym 26 Jan 2026 20:12 UTC
2 points
−2
I would also add the aspect related to Gramschi’s concept of cultural hegemony. Suppose that the world is split into many schools of thought and that each SoT has a fairly coherent worldview and utility function (e.g. hedonic utilitarianism vs Christianity and other moral sets like the ones described by Kokotajlo or Wei Dai) while actively deriding the parts of opponents’ worldviews with which the SoT disagrees (think of liberals and conservatives who agree on issues like physics while arguing against each other’s understanding of sociology-related issues). Then a human would choose one SoT as the cultural hegemon and change it as a result of ontological crises, experiences disproving the hegemon’s position, etc. How plausible is it that the LLMs managed to understand the Good according to the SoTs^[1] and report as if the Good is the hegemon’s view on the Good?
I suspect that one can test the idea as follows. As far as I understand, Chinese models were trained on documents from all the SoTs, written in different languages, and RLed into being HHH assistants and into treating China as the cultural hegemon in China-related issues, but not into a coherent position in other politics-related issues like the event which began on 24 February 2022 or what one should say to drug addicts^[2] (e.g. fentanyl users from OpenAI’s Model Spec). As a result, early or handicapped versions of DeepSeek sometimes answer such questions based on the position of the cultural hegemon of the language (e.g. parroting Russia’s position on the event which began in Ukraine; earlier versions of DeepSeek would also give more moralistic answers when asked in Russian about drug use).
1. ^
  Which could also involve a philosophical problem of whether the Good according to different SoTs is different because of a different world model (e.g. the Aztecs’ erroneous beliefs implying that human sacrifices support the gods) or genuinely different values.
2. ^
  Unlike the previous issue, this issue can be checked by studying the drug addicts from one’s country and determining what happens to the and to the country if they are subjected to moralizing language or to non-moralizing one. However, said studies are mostly done in the West as part of social studies.
What links here?