habryka comments on Arjun Panickssery’s Shortform

habryka 30 May 2026 3:48 UTC
4 points
0
That if everyone underwent an idealization procedure, they would find some kind of common ground
No, it just doesn’t assume that. It’s totally fine for different people to want different things, and for their extrapolated values to diverge, under Eliezer’s metaethics.
That people should care, personally, about what their idealization procedure would produce
Yes, it does assume this! But honestly, anything different from this seems kind of absurd. Clearly there are some actions you can take that make you think you will make better ethical judgements in the future. “Sleeping enough” is one such very boring action that I think practically everyone would endorse.
It just seems like a very obvious fact that the preferences of basically all humans have idealization characteristics so that there are changes people could make to themselves that would make them want to defer to that changed version of themselves, instead of their current selves. Making all such changes is what CEV is. This doesn’t necessarily “solve” ethics, but it establishes at least one thing you clearly should do if you want to make progress on ethics.
- Arjun Panickssery 30 May 2026 4:31 UTC
  6 points
  1
  Parent
  No, it just doesn’t assume that. It’s totally fine for different people to want different things, and for their extrapolated values to diverge, under Eliezer’s metaethics.
  Ah ok. I admit I don’t know much about CEV, compared to the other two listed items in my top-level post. This document admits (emphasis):
  Q9. How does the dynamic force individual volitions to cohere? (Frequently Asked)
  The dynamic doesn’t force anything. The engineering goal is to ask what humankind “wants,” or rather what we would decide if we knew more, thought faster, were more the people we wished we were, had grown up farther together, etc. “There is nothing which humanity can be said to ‘want’ in this sense” is a possible answer to this question. Meaning, you took your best shot at asking what humanity wanted, and humanity didn’t want anything coherent.
  It defines coherence as “Strong agreement between many extrapolated individual volitions which are unmuddled and unspread in the domain of agreement, and not countered by strong disagreement.” So while it is conceded, in passing, that there might not be a result, I assumed that Yudkowsky thinks it’s plausible, because otherwise it wouldn’t make sense to advocate for CEV as the target for AI alignment. (I guess it’s possible that he concluded that it would be better for an AI not to do anything in that case, as a safe failure mode, versus to act on a different alignment target.)
  Clearly there are some actions you can take that make you think you will make better ethical judgements in the future.
  This doesn’t address the is-ought gap. I agree that if you already accept moral realism then this kind of thing is a relevant consideration, but positing an idealization procedure doesn’t solve meta-ethics. Things like “sleeping enough” only corrects non-moral defects like fatigue but doesn’t address the question of whether the resulting judgments are objectively good. In contrast, the “be more the people you wished you were” in Yudkowsky’s idealization procedure introduces moral knowledge and values (insofar as “wishing” is an evaluative attitude), but that creates circularity.
  In particular all of the theories based on an idealization procedure fail because either
  The idealization procedure is taken to include moral knowledge, creating circularity, or
  The idealization procedure only includes rationality in the making of non-moral judgments, knowledge of non-moral facts, etc, in which case this is a reductionist meta-ethics that doesn’t actually cross the is-ought gap (i.e. it remains an open question whether the idealized attitudes would be good).
  My broader accusation is that this kind of talk is used for crypto-realism; people want to basically talk in terms of stance-independent moral facts. But they merely frame the discussion in terms of what their idealized self would believe, when in reality the idealization procedure is either circular or can’t cross the is-ought gap and introduce moral knowledge. You yourself just talked in terms of “progress on ethics” and “better ethical judgments” but by that you could mean either
  1. “Progress on figuring out what my idealized self would think / what judgments he would make”—how does this illuminate any metaethics?
  2. “Progress on figuring out objective, or stance-independent, ethics/judgments”—how would the idealized self be authoritative about that, especially if they diverge among people?
  - habryka 30 May 2026 4:38 UTC
    2 points
    0
    Parent
    But they merely frame the discussion in terms of what their idealized self would believe, when in reality the idealization procedure is either circular or can’t cross the is-ought gap and introduce moral knowledge.
    I mean, in as much as morality is a thing at all, it’s bound by logical constraints. In order for preferences to make any sense, they must adhere to at least very basic logical constraints, and that alone admits for a huge amount of stance-independent reasoning.
    I like to generally speak of “moral axioms” and “moral inference rules” and then at least one kind of valid stance-independent reasoning you can do is to map out what conclusions you can infer from a set of moral axioms and moral inference rules.
    This of course doesn’t solve everything about ethics, but I feel like you clearly can’t deny the ability to do some amount of logical inference on top of your preferences.
    (And then this starts allowing saying generalized things about classes of moral axioms and classes of moral inference rules. You can talk about how likely it is for human morality to generally converge, in a similar way you can talk about different mathematical inference systems turning out to be equivalent, even if that doesn’t tell you which mathematical axioms are the “correct ones” to use.)
    What links here?
    Arjun Panickssery's comment on Arjun Panickssery’s Shortform by Arjun Panickssery (31 May 2026 1:11 UTC; 3 points)
    - Arjun Panickssery 31 May 2026 0:54 UTC
      3 points
      1
      Parent
      I agree with everything in this response. In particular, I don’t mean to “deny the ability to do some amount of logical inference on top of your preferences.”
      My point is that it doesn’t answer the key metaethical question of why you ought to act according to any of those ideas.
      - habryka 31 May 2026 6:51 UTC
        2 points
        0
        Parent
        I mean, because you are applying logical inferences on top of your existing oughts?
        As long as you grant that you ought to care about some things, and that you ought to care about things in any kind of coherent way, then you ought to care about the different things that are implied by the things you already ought to care about.
        But I feel like I am restating things here, so I might have misunderstood you.
        Steven Byrnes 2 Jun 2026 17:09 UTC
        5 points
        3
        Parent
        If you ask lots of people whether their moral preferences ought to be self-consistent, they’ll mostly say yes. If you ask lots of people whether their moral preferences are more valid after they think about them longer, after a good night’s sleep, they’ll also mostly say yes.
        But also, if you ask lots of people whether it’s moral for their family to be tortured, they’ll mostly say no. And they probably won’t say that no-torture is less important than self-consistency.
        Here are three (IMO reasonable) people arguing that moral deliberation / self-consistency does not straightforwardly and universally trump other ways to reach normative conclusions: Scott Alexander:
        But I’m not sure I want to play the philosophy game. Maybe MacAskill can come up with some clever proof that the commitments I list above imply I have to have my eyes pecked out by angry seagulls or something. If that’s true, I will just not do that, and switch to some other set of axioms. If I can’t find any system of axioms that doesn’t do something terrible when extended to infinity, I will just refuse to extend things to infinity.
        plus Stuart Armstrong here, and Joe Carlsmith discusses this a bunch (kinda arguing both sides) here & here & here.
        Anyway, if we’re gonna treat CEV (and related things like Long Reflection) as meta-ethical ground truth (and not just as pragmatic projects to design a widely-acceptable ASI motivation system, per my other comment), then we have to grant moral deliberation and self-consistency a special status, NOT just “well yeah self-consistency is one of the things that people feel is good and right, along with all the other things that people feel are good and right”. And I think Arjun is asking: where would this special status come from?
        It’s evidently not grounded in people’s moral intuitions, because people’s moral intuitions in favor of self-consistency are not systematically stronger or different-in-kind from people’s moral intuitions in favor of justice or whatever else. Alternatively, if we want to ground it in, like, “well they’d appreciate the value of self-consistency if they thought about it more”, then that’s circular question-begging, because it’s already granting a special status to deliberation.
        habryka 2 Jun 2026 17:40 UTC
        3 points
        0
        Parent
        I think you are probably misinterpreting me here, though the domain is tricky, so that’s understandable.
        I advocate that you only take the steps towards consistency that are endorsed. There are really quite a lot of those! This does not require giving (apparent) logical consistency some kind of supremacy. Indeed, I would strongly argue against the kind of philosophy that MacAskill tends to do, and don’t think it really has much to do with the thing that I expect to happen during CEV.
        The way I usually phrase it is that you list all the interventions that you could make to your beliefs and brain, and you start doing the ones that seem the most robust under really any viewpoint (e.g. something like “make sure to get enough sleep”). Then you work your way down the list, very conservatively taking actions or propagating beliefs that seem less reversible or robust.^[1]
        I think the default outcome of this maximally conservative approach is that you still end up somewhere extremely different from where you started, and it doesn’t really require giving self-consistency some kind of dominating overriding status where someone gives you a clever argument with horrifying conclusions and then you have to accept it. Indeed, not accepting those arguments seems extremely wise to me.
        Yes, this does require some degree to which my moral beliefs are subject to consistency, but of course, they would have no meaning at all if they were not at least subject to some minimal levels of consistency.
        A preference needs to ground in reality somehow, and for the things over which you have preferences to “be real” in some meaningful sense. And the subject of this conversation is the kind of preference that makes sense for humans to endorse and make plans around. A bundle of local-minimization urges does not write internet comments, or thinks about what they would like a future AI system to do with them, or cares about “metaethics” at all.
        ^
        This would reasonably also include things like “make a copy of yourself that you give veto power to that you check in with after you’ve gone down a path of self-reflection and self-modification”.
        Steven Byrnes 3 Jun 2026 13:39 UTC
        2 points
        0
        Parent
        That all sounds fine, if we’re engaged in a pragmatic project for deciding what to do, and want to propose an answer that you and I can get behind, and that lots of people around the world can also get behind.
        I think Arjun is (rightly) complaining about something different, namely that Eliezer and you and others frequently slip into treating this answer as being fundamentally privileged / “Right”, as opposed to merely a pragmatic option that you and I and lots of people can get behind.
        E.g. here’s Nate referring to “the future’s potential value”, as if there’s a metric for that which is canonical and characteristic of humanity-as-a-whole. I think that’s moral-realist (or “crypto”-moral-realist) thinking, sneaking in.
        habryka 3 Jun 2026 17:52 UTC
        3 points
        1
        Parent
        Hmm, I don’t really get this. Or like, I am about as sympathetic to this argument as someone saying “E.g. here’s Nate referring to ‘the future’ as a thing that exists, as if there is consensus on there being a single reality and arrow of time. I think that’s scientific materialist thinking sneaking in, denying the possibility of solipsism or simulationism”. To which my reaction is “yes, metaphysics is actually quite confusing, but come on man, you know what I mean, in as much as words mean anything, this is a fine use of them”.
        Similarly here, my reaction is: “Come on man, you know what Nate means. In as much as ‘preferences’ mean anything, there is an up-direction for humanity as a whole, and a down-direction for humanity as a whole, even without any kind of substantial convergence, given how far away we are from the Pareto frontier from anything”.