Alignment Forum recent comments

Wei Dai 17 Nov 2025 5:20 UTC
LW: 9 AF: 6
0
AF
in reply to: Eliezer Yudkowsky’s comment on: Problems I’ve Tried to Legibilize
To give a direct answer, not a lot come to mind outside of the MIRI cluster. I think the Center on Long-Term Risk cluster did a bunch of work on decision theory and acausal trade, but it was mostly after I had moved on to other topics, so I’m not sure how much of it constituted progress. Christiano acknowledged some of the problems I pointed out with IDA and came up with some attempted solutions, which I’m not convinced really work.

However, in my previous post, Legible vs. Illegible AI Safety Problems, I explained my latest thinking that the most important motivation for legibilizing AI safety problems isn’t to induce faster progress on them as object-level problems, but instead to decrease the probability that AGI/ASI is developed or deployed while key decision makers (e.g., company leaders, government officials, voters) are not even aware of or don’t understand the importance of some such problems. So a better metric for measuring the success of this strategy is how much increased legibility has been effected in this wider audience, assuming “how successful has it been” is the main motivation behind your question.

On that front, I think the main weakness of my approach has been its limited reach beyond LW. If someone with better public communications skills were convinced of the value of legibilizing these lesser known problems, that could potentially greatly boost the effectiveness of this strategy.

(Of course, if I’ve inferred a wrong motivation for your question, please let me know!)

Eliezer Yudkowsky 16 Nov 2025 23:33 UTC
LW: 12 AF: 7
0
AF
on: Problems I’ve Tried to Legibilize
Has anyone else, or anyone outside the tight MIRI cluster, made progress on any of the problems you’ve tried to legibilize for them?

Wei Dai 16 Nov 2025 2:41 UTC
LW: 18 AF: 5
0
AF
in reply to: Richard_Ngo’s comment on: Please, Don’t Roll Your Own Metaethics
You may have missed my footnote, where I addressed this?

To preempt a possible misunderstanding, I don’t mean “don’t try to think up new metaethical ideas”, but instead “don’t be so confident in your ideas that you’d be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way”. Similarly “don’t roll your own crypto” doesn’t mean never try to invent new cryptography, but rather don’t deploy it unless there has been extensive review, and consensus that it is likely to be secure.

Richard_Ngo 16 Nov 2025 0:42 UTC
LW: 30 AF: 16
17
AF
on: Please, Don’t Roll Your Own Metaethics
“Please don’t roll your own crypto” is a good message to send to software engineers looking to build robust products. But it’s a bad message to send to the community of crypto researchers, because insofar as they believe you, then you won’t get new crypto algorithms from them.
In the context of metaethics, LW seems much more analogous to the “community of crypto researchers” than the “software engineers looking to build robust products”. Therefore this seems like a bad message to send to LessWrong, even if it’s a good message to send to e.g. CEOs who justify immoral behavior with metaethical nihilism.

Alex Mallen 15 Nov 2025 19:35 UTC
LW: 1 AF: 1
0
AF
on: Legible vs. Illegible AI Safety Problems
Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems.
This model seems far too simplified, and I don’t think it leads to the right conclusions in many important cases (e.g., Joe’s):
- Many important and legible safety problems don’t slow development. I think it’s extremely unlikely, for example, that Anthropic or others would slow development because of a subpar model spec. I think in the counterfactual where Joe doesn’t work on the model spec (1) the model spec is worse and (2) dangerously capable AI happens just as fast. The spec would likely be worse in ways that both increase takeover risk and decrease the expected value of the future conditional on (no) takeover.
- The best time to work on AI x-risk is probably when it’s most legible. In my view, the most valuable time to be doing safety work is just before AIs become dangerously capable, because e.g., then we can better empirically iterate (of course, you can do this poorly as John Wentworth argues). At this point, the x-risk problems will likely be legible (e.g., because they’re empirically demonstrable in model organisms). I think it would quite plausibly be a mistake not to work on x-risk problems at this time when they’ve just become more tractable because of their increased legibility! (You were making the claim about legibility holding tractability fixed, but in fact tractability is highly correlated with legibility. Though, admittedly, also lower neglectedness.)

orthonormal 15 Nov 2025 18:03 UTC
LW: 4 AF: 3
0
AF
in reply to: cousin_it’s comment on: Problems I’ve Tried to Legibilize
When it specifically comes to loss-of-control risks killing or sidelining all of humanity, I don’t believe Sam or Dario or Demis or Elon want that to happen, because it would happen to them too. (Larry Page is different on that count, of course.) You do have conflict theory over the fact that some of them would like ASI to make them god-emperor of the universe, but all of them would definitely take a solution to “loss of control” if it were handed to them on a silver platter.

Wei Dai 15 Nov 2025 11:57 UTC
LW: 4 AF: 3
0
AF
in reply to: Lukas_Gloor’s comment on: Please, Don’t Roll Your Own Metaethics

By “metaethics,” do you mean something like “a theory of how humans should think about their values”?

I feel like I’ve seen that kind of usage on LW a bunch, but it’s atypical. In philosophy, “metaethics” has a thinner, less ambitious interpretation of answering something like, “What even are values, are they stance-independent, yes/no?”

By “metaethics” I mean “the nature of values/morality”, which I think is how it’s used in academic philosophy. Of course the nature of values/morality has a strong influence on “how humans should think about their values” so these are pretty closely connected, but definitionally I do try to use it the same way as in philosophy, to minimize confusion. This post can give you a better idea of how I typically use it. (But as you’ll see below, this is actually not crucial for understanding my post.)

Anyway, I’m asking about this because I found the following paragraph hard to understand:

So in the paragraph that you quoted (and the rest of the post), I was actually talking about philosophical fields/ideas in general, not just metaethics. While my title has “metaethics” in it, the text of the post talks generically about any “philosophical questions” that are relevant for AI x-safety. If we substitute metaethics (in my or the academic sense) into my post, then you can derive that I mean something like this:

Different metaethics (ideas/theories about the nature of values/morality) have different implications for what AI designs or alignment approaches are safe, and if you design an AI assuming that one metaethical theory is true, it could be disastrous if a different metaethical theory actually turns out to be true.

For example, if moral realism is true, then aligning the AI to human values would be pointless. What you really need to do is design the AI to be able to determine and follow objective moral truths. But this approach would be disastrous if moral realism is actually false. Similarly, if moral noncognitivism is true, that means that humans can’t be wrong about their values, and implies “how humans should think about their values” is of no importance. If you design AI under this assumption, that would be disastrous if actually humans can be wrong about their values and they really need AIs to help them think about their values and avoid moral errors.

I think in practice a lot of alignment researchers may not even have explicit metaethical theories in mind, but are implicitly making certain metaethical assumptions in their AI design or alignment approach. For example they may largely ignore the question of how humans should think about their values or how AIs should help humans think about their values, thus essentially baking in an assumption of noncognitivism.

You’re conceding that morality/values might be (to some degree) subjective, but you’re cautioning people from having strong views about “metaethics,” which you take to be the question of not just what morality/values even are, but also a bit more ambitiously: how to best reason about them and how to (e.g.) have AI help us think about what we’d want for ourselves and others.

If we substitute “how humans/AIs should reason about values” (which I’m not sure has a name in academic philosophy but I think does fall under metaphilosophy, which covers all philosophical reasoning) into the post, then your conclusion here falls out, so yes, it’s also a valid interpretation of what I’m trying to convey.

I hope that makes everything a bit clearer?

cousin_it 15 Nov 2025 10:11 UTC
LW: 5 AF: 3
1
AF
in reply to: Wei Dai’s comment on: Problems I’ve Tried to Legibilize
I think AI offers a chance of getting huge power over others, so it would create competitive pressure in any case. In case of a market economy it’s market pressure, but in case of countries it would be a military arms race instead. And even if the labs didn’t get any investors and raced secretly, I think they’d still feel under a lot of pressure. The chance of getting huge power is what creates the problem, that’s why I think spreading out power is a good idea. There would still be competition of course, but it would be normal economic levels of competition, and people would have some room to do the right things.

Wei Dai 15 Nov 2025 5:34 UTC
LW: 2 AF: 2
8
AF
in reply to: cousin_it’s comment on: Problems I’ve Tried to Legibilize

A lab leader who’s concerned enough to slow down will be pressured by investors to speed back up, or get replaced, or get outcompeted. Really you need to convince the whole lab and its investors. And you need to be more convincing than the magic of the market!

This seems to imply that lab leaders would be easier to convince if there were no investors and no markets, in other words if they had more concentrated power.

If you spread out the power of AI more, won’t all those decentralized nodes of spread out AI power still have to compete with each other in markets? If market pressures are the core problem, how does decentralization solve that?

I’m concerned that your proposed solution attacks “concentration of power” when the real problem you’ve identified is more like market dynamics. If so, it could fail to solve the problem or make it even worse.

My own perspective is that markets are a definite problem, and concentration of power per se is more ambiguous (I’m not sure if it’s good or bad). To solve AI x-safety we basically have to bypass or override markets somehow, e.g., through international agreements and government regulations/bans.

lemonhope 14 Nov 2025 9:42 UTC
LW: 2 AF: 1
0
AF
in reply to: Wei Dai’s comment on: Please, Don’t Roll Your Own Metaethics
The WWDSC is nearly a consensus. Certainly a plurality.

Daniel Kokotajlo 13 Nov 2025 21:08 UTC
LW: 5 AF: 3
0
AF
on: Weight-sparse transformers have interpretable circuits
Nice! Do you have thoughts on how to scale this to larger circuits? Presumably circuitry like “the high-level goals and principles used to make important decisions” involve a lot more than just two neurons and two attention channels.

Wei Dai 13 Nov 2025 6:48 UTC
LW: 4 AF: 3
2
AF
in reply to: Raemon’s comment on: Please, Don’t Roll Your Own Metaethics
I hinted at it with “prior efforts/history”, but to spell it out more, metaethics seems to have a lot more effort gone into it in the past, so there’s less likely to be some kind of low hanging fruit in idea space, that once picked, everyone will agree is the right solution.

Raemon 13 Nov 2025 5:42 UTC
LW: 6 AF: 2
0
AF
in reply to: Wei Dai’s comment on: Please, Don’t Roll Your Own Metaethics
To preempt a possible misunderstanding, I don’t mean “don’t try to think up new metaethical ideas”, but instead “don’t be so confident in your ideas that you’d be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way”.
I think I had missed this, but, it doesn’t resolve the confusion in my #2 note. (like, still seems like something is weird about saying “solve metaphilosophy such that every can agree is correct” is more worth considering than “solve metaethics such that everyone can agree is correct”. I can totally buy that they’re qualitatively different and maybe have some guesses for why you think that. But I don’t think the post spells out why and it doesn’t seem that obvious to me)

Wei Dai 13 Nov 2025 4:31 UTC
LW: 8 AF: 5
4
AF
in reply to: lemonhope’s comment on: Please, Don’t Roll Your Own Metaethics
The problem is that we can’t. The closest thing we have is instead a collection of mutually exclusive ideas where at most one (possibly none) is correct, and we have no consensus as to which.

RobertM 13 Nov 2025 3:10 UTC
LW: 14 AF: 5
9
AF
on: Please, Don’t Roll Your Own Metaethics
Another meta line of argument is to consider how many people have strongly held, but mutually incompatible philosophical positions.
I’ve been banging my head against figuring out why this line of argument doesn’t seem convincing to many people for at least a couple of years. I think, ultimately, it’s probably because it feels defeatable by plans like “we will make AIs solve alignment for us, and solving alignment includes solving metaphilosophy & then object-level philosophy”. I think those plans are doomed in a pretty fundamental sense, but if you don’t think that, then they defeat many possible objections, including this one.
As they say: Everyone who is hopeful has their own reason for hope. Everyone who is doomful^[1]...
1. ^
  In fact it’s not clear to me. I think there’s less variation, but still a fair bit.

Wei Dai 13 Nov 2025 2:02 UTC
LW: 8 AF: 5
2
AF
in reply to: Raemon’s comment on: Please, Don’t Roll Your Own Metaethics
#2 feels like it’s injecting some frame that’s a bit weird to inject here (don’t roll your own metaethics… but rolling your own metaphilosophy is okay?)
Maybe you missed my footnote?
To preempt a possible misunderstanding, I don’t mean “don’t try to think up new metaethical ideas”, but instead “don’t be so confident in your ideas that you’d be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way”. Similarly “don’t roll your own crypto” doesn’t mean never try to invent new cryptography, but rather don’t deploy it unless there has been extensive review, and consensus that it is likely to be secure.
and/or this part of my answer (emphasis added):
Try to solve metaphilosophy, where potentially someone could make a breakthrough that everyone can agree is correct (after extensive review)
But also, I’m suddenly confused about who this post is trying to warn. Is it more like labs, or more like EA-ish people doing a wider variety of meta-work?
I think I mostly had alignment researchers (in and out of labs) as the target audience in mind, but it does seem relevant to others so perhaps I should expand the target audience?

Raemon 13 Nov 2025 1:53 UTC
LW: 5 AF: 2
12
AF
in reply to: Wei Dai’s comment on: Please, Don’t Roll Your Own Metaethics
Hmm, I like #1.

#2 feels like it’s injecting some frame that’s a bit weird to inject here (don’t roll your own metaethics… but rolling your own metaphilosophy is okay?)
But also, I’m suddenly confused about who this post is trying to warn. Is it more like labs, or more like EA-ish people doing a wider variety of meta-work?

Wei Dai 13 Nov 2025 1:24 UTC
LW: 37 AF: 12
15
AF
in reply to: Raemon’s comment on: Please, Don’t Roll Your Own Metaethics
“More research needed” but here are some ideas to start with:
1. Try to design alignment/safety schemes that are agnostic or don’t depend on controversial philosophical ideas. For certain areas that seem highly relevant and where there could potentially be hidden dependencies (such as metaethics), explicitly understand and explain why, under each plausible position that people currently hold, the alignment/safety scheme will result in a good or ok outcome. (E.g., why it leads to a good outcome regardless of whether moral realism or anti-realism is true, or any one of the other positions.)
2. Try to solve metaphilosophy, where potentially someone could make a breakthrough that everyone can agree is correct (after extensive review), which can then be used to speed up progress in all other philosophical fields. (This could also happen in another philosophical field, but seems a lot less likely due to prior efforts/history. I don’t think it’s very likely in metaphilosophy either, but perhaps worth a try, for those who may have very strong comparative advantage in this.)
3. If 1 and 2 look hard or impossible, make this clear to non-experts (your boss, company leaders/board, government officials, the public), don’t let them accept a “roll your own metaethics” solution, or a solution with implicit/hidden philosophical assumptions.
4. Support AI pause/stop.

Lukas_Gloor 13 Nov 2025 0:26 UTC
LW: 11 AF: 3
3
AF
on: Please, Don’t Roll Your Own Metaethics
By “metaethics,” do you mean something like “a theory of how humans should think about their values”?

I feel like I’ve seen that kind of usage on LW a bunch, but it’s atypical. In philosophy, “metaethics” has a thinner, less ambitious interpretation of answering something like, “What even are values, are they stance-independent, yes/no?”
And yeah, there is often a bit more nuance than that as you dive deeper into what philosophers in the various camps are exactly saying, but my point is that it’s not that common, and certainly not necessary, that “having confident metaethical views,” on the academic philosophy reading of “metaethics,” means something like “having strong and detailed opinions on how AI should go about figuring out human values.”
(And maybe you’d count this against academia, which would be somewhat fair, to be honest, because parts of “metaethics” in philosophy are even further removed from practicality, as they concern the analysis of the language behind moral claims, which, if we compare it to claims about the Biblical God and miracles, it would be like focusing way too much on whether the people who wrote the Bible thought they were describing real things or just metaphores, without directly trying to answer burning questions like “Does God exist?” or “Did Jesus live and perform miracles?”)
Anyway, I’m asking about this because I found the following paragraph hard to understand:
Behind a veil of ignorance, wouldn’t you want everyone to be less confident in their own ideas? Or think “This isn’t likely to be a subjective question like morality/values might be, and what are the chances that I’m right and they’re all wrong? If I’m truly right why can’t I convince most others of this? Is there a reason or evidence that I’m much more rational or philosophically competent than they are?”
My best guess of what you might mean (low confidence) is the following:

You’re conceding that morality/values might be (to some degree) subjective, but you’re cautioning people from having strong views about “metaethics,” which you take to be the question of not just what morality/values even are, but also a bit more ambitiously: how to best reason about them and how to (e.g.) have AI help us think about what we’d want for ourselves and others.

Is that roughly correct?
Because if one goes with the “thin” interpretation of metaethics, then “having one’s own metaethics” could be as simple as believing some flavor of “morality/values are subjective,” and it feels like you, in the part I quoted, don’t sound like you’re too strongly opposed to just that stance in itself, necessarily.

Raemon 13 Nov 2025 0:11 UTC
LW: 64 AF: 16
44
AF
on: Please, Don’t Roll Your Own Metaethics
What are you supposed to do other than roll your own metaethics?
What links here?
- Max H's comment on Racing For AI Safety™ was always a bad idea, right? by Wei Dai (16 Nov 2025 4:04 UTC; 4 points)