Steven Byrnes comments on Foom & Doom 2: Technical alignment is hard

Steven Byrnes 19 Sep 2025 13:40 UTC
LW: 7 AF: 6
1
AF
My read of this conversation is that we’re basically on the same page about what’s true but disagree about whether Eliezer is also on that same page too. Again, I don’t care. I already deleted the claim about what Eliezer thinks on this topic, and have been careful not to repeat it elsewhere.
Since we’re talking about it, my strong guess is that Eliezer would ace any question about utility functions and what’s their domain and when is “utility-maximizing behavior” vacuous, … if asked directly.
But it’s perfectly possible to “know” something when asked directly, but also to fail to fully grok the consequences of that thing and incorporate it into some other part of one’s worldview. God knows I’m guilty of that, many many times over!
Thus my low-confidence guess is that Eliezer is guilty of that too, in that the observation “utility-maximizing behavior per se is vacuous” (which I strongly expect he would agree with if asked directly) has not been fully reconciled with his larger thinking on the nature of the AI x-risk problem.
(I would further add that, if Eliezer has fully & deeply incorporated “utility-maximizing behavior per se is vacuous” into every other aspect of his thinking, then he is bad at communicating that fact to others, in the sense that a number of his devoted readers wound up with the wrong impression on this point.)
Anyway, I feel like your comment is some mix of “You’re unfairly maligning Eliezer” (again, whatever, I have stopped making those claims) and “You’re wrong that this supposed mistake that you attribute to Eliezer is a path through which we can solve the alignment problem, and Eliezer doesn’t emphasize it because it’s an unimportant dead-end technicality” (maybe! I don’t claim to have a solution to the alignment problem right now; perhaps over time I will keep trying and failing and wind up with a better appreciation of the nature of the blockers).
Most of your comment is stuff I already agree with (except that I would use the term “desires” in most places that you wrote “utility function”, i.e. where we’re talking about “how AI cognition will look like”).
I don’t follow what you think Eliezer means by “consequentialism”. I’m open-minded to “farfuturepumping”, but only if you convince me that “consequentialism” is actually misleading. I’m don’t endorse coining new terms when an existing term is already spot-on.
- Towards_Keeperhood 19 Sep 2025 19:00 UTC
  1 point
  0
  Parent
  To me it seems a bit surprising that you say we agree on the object level, when in my view you’re totally guilty of my 2.b.i point above of not specifying the tradeoff / not giving a clear specification of how decisions are actually made.
  I also think the utility maximizer frame is useful, though there are 2 (IMO justified) assumptions that I see as going along with it:
  1. There’s sth like a simplicity prior over the space of utility functions (because there needs to be some utility maximizing structure implemented in the AI).
  2. The utility function is a function of the trajectory of the environment. (Or in even better formalization it may take as input a program which is the environment.)
    I think using a learned value function (LVF) that computes valence of thoughts is a worse frame to use for tackling corrigibility because it’s harder to clearly evaluate what actions the agent will end up taking. And because this kind of “imagine some plan and what the outcome would be and let the LVF evaluate that” doesn’t seem to me how smarter than human minds operate—considering what change in the world an action would cause seems more natural than whether some imagined scene seems appealing. Even humans like me move away from the LVF frame, e.g. I’m trying to correct for scope insensitivity of my LVF by doing sth more like explicit expected utility calculations.^[1]
  “You’re wrong that this supposed mistake that you attribute to Eliezer is a path through which we can solve the alignment problem, and Eliezer doesn’t emphasize it because it’s an unimportant dead-end technicality” (maybe! I don’t claim to have a solution to the alignment problem right now; perhaps over time I will keep trying and failing and wind up with a better appreciation of the nature of the blockers).
  I’m more like “Your abstract guesturing didn’t let me see any concrete proposal that would make me more hopeful, and even if good proposals are in that direction it seems to me like most of the work would still be ahead instead of it being like ‘we can just do it sorta like that’ as you seem to present it. But maybe I’m wrong and maybe you have more intuitions and will find a good concrete proposal.”.
  I don’t follow what you think Eliezer means by “consequentialism”. I’m open-minded to “farfuturepumping”, but only if you convince me that “consequentialism” is actually misleading.
  Maybe study logical decision theory? Not sure where to best start but maybe here:
  “Logical decision theories” are algorithms for making choices which embody some variant of “Decide as though you determine the logical output of your decision algorithm.”
  Like consequentialism in the sense of “what’s the consequence of choosing the logical output of your decision algorithm in a particular way”, where consequence here isn’t a time-based event but rather the way the universe looks like conditional on the output of your decision algorithm.
  1. ^
    I’m not confident those are the only reason why LVF seems worse here, I didn’t fully articulate my intuitions yet.
  - Steven Byrnes 19 Sep 2025 19:48 UTC
    2 points
    0
    Parent
    Maybe study logical decision theory?
    Eliezer has always been quite clear that you should one-box for Newcomb’s problem because then you’ll wind up with more money. The starting point for the whole discussion is a consequentialist preference—you have desires about the state of the world after the decision is over.
    You have desires, and then decision theory tells you how to act so as to bring those desires about. The desires might be entirely about the state of the world in the future, or they might not be. Doesn’t matter. Regardless, whatever your desires are, you should use good decision theory to make decisions that will lead to your desires getting fulfilled.
    Thus, decision theory is unrelated to our conversation here. I expect that Eliezer would agree.
    To me it seems a bit surprising that you say we agree on the object level, when in my view you’re totally guilty of my 2.b.i point above of not specifying the tradeoff / not giving a clear specification of how decisions are actually made.
    Your 2.a is saying “Steve didn’t write down a concrete non-farfuturepumping utility function, and maybe if he tried he would get stuck”, and yeah I already agreed with that.
    Your 2.b is saying “Why can’t you have a utility function but also other preferences?”, but that’s a very strange question to me, because why wouldn’t you just roll those “other preferences” into the utility function as you describe the agent? Ditto with 2.c, why even bring that up? Why not just roll that into the agent’s utility function? Everything can always be rolled into the utility function. Utility functions don’t imply anything about behavior, and they don’t imply reflective consistency, etc., it’s all vacuous formalizing unless you put assumptions / constraints on the utility function.
    - Towards_Keeperhood 19 Sep 2025 20:52 UTC
      1 point
      0
      Parent
      The purpose of studying LDT would be to realize that the type signature you currently imagine Steve::consequentialist preferences to have is different from the type signature that Eliezer would imagine.
      The starting point for the whole discussion is a consequentialist preference—you have desires about the state of the world after the decision is over.
      You can totally have preferences about the past that are still influenced by your decision (e.g. Parfit’s hitchhiker).
      Decisions don’t cause future states, they influence which worlds end up real vs counterfactual. Preferences aren’t over future states but over worlds—which worlds would you like to be more real?
      AFAIK Eliezer only used the word “consequentialism” in abstract descriptions of the general fact that you (usually) need some kind of search in order to find solutions to new problems. (Like I think just using a new word for what he used to call optimization.) Maybe he also used the outcome pump as an example, but if you asked him what how consequentialist preferences look like in detail, I’d strongly bet he’d say sth like preferences over worlds rather than preferences over states in the far future.