Elliott Thornley (EJT) comments on Max Harms’s Shortform

Elliott Thornley (EJT) 25 Nov 2025 15:18 UTC
1 point
0
They don’t have to be short-term oriented! Their utility function could be:
$u = f (Σ_{i} p_{i})$
Where $f$ is some strictly concave function and $p_{i}$ is the agent’s payment at time $i$ . Agents with this sort of utility function don’t discount the future at all. They care just as much about improvements to $p_{i}$ regardless of whether $i$ is 1 or 1 million. And yet, for the right kind of $f$ , these agents can be risk-averse enough to prefer a small salary with higher probability to a shot at eating the lightcone with lower probability.
- Max Harms 25 Nov 2025 18:22 UTC
  2 points
  0
  Parent
  Sorry, I guess I’m confused. Let me try and summarize where I feel like I’m at and what I’m hearing from you.
  I think, if you’re an AGI, not trying to take over is extremely risky, because humans and future AIs are likely to replace you, in one way or another. But I also think that if you try to take over, this is also extremely risky, because you might get caught and turned off. I think the question of which is more risky depends on circumstance (e.g. how good is the security preventing you from seizing power), and so “risk aversion” is not a reliable pathway to unambitious AIs, because ambition might be less risky, in the long run.
  I agree that if it’s less risky to earn a small salary, then if your concave function is sharp enough, the AI might choose to be meek. That doesn’t really feel like it’s engaging with my point about risk aversion only leading to meekness if trusting humans is genuinely less risky.
  What I thought you were pointing out was that “in the long run” is load-bearing, in my earlier paragraph, and that temporal discounting can be a way to protect against the “in the long run I’m going to be dead unless I become God Emperor of the universe” thought. (I do think that temporal discounting is a nontrivial shield, here, and is part of why so few humans are truly ambitious.) Here’s a slightly edited and emphasized version of the paragraph I was responding to:
  [D]epending on what the agent is risk-averse with respect to, they might choose [meekness]. If [agents are] … risk-neutral with respect to length of life, they’ll choose [ambition]. If they’re risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they’ll choose the [meekness].
  Do we actually disagree? I’m confused about your point, and feel like it’s just circling back to “what if trusting humans is less risky”, which, sure, we can hope that’s the case.
  - Elliott Thornley (EJT) 26 Nov 2025 16:45 UTC
    1 point
    0
    Parent
    Yeah risk aversion can only make the AI cooperate if the AI thinks that getting paid for cooperation is more likely than successful rebellion. It seems pretty plausible to me that this will be true for moderately powerful AIs, and that we’ll be able to achieve a lot with the labor of these moderately powerful AIs, e.g. enough AI safety research to pre-empt the existence of extremely powerful misaligned AIs (who likely would be more confident of successful rebellion than getting paid for cooperation).