Steven Byrnes comments on “The Era of Experience” has an unsolved technical alignment problem

Steven Byrnes 5 May 2025 14:55 UTC
LW: 5 AF: 4
0
AF
Thanks!
RE 2 – I was referring here to (what I call) “brain-like AGI”, a yet-to-be-invented AI paradigm in which both “human-like ability to reason” and “human-like social and moral instincts / reflexes” are in a nuts-and-bolts sense, like they’re actually doing the same kinds of algorithmic steps that a human brain would do. Human brains are quite different from LLMs, even if their text outputs can look similar. For example, small groups of humans can invent grammatical languages from scratch, and of course historically humans invented science and tech and philosophy and so on from scratch. So bringing up “training data” is kinda the wrong idea. Indeed, for humans (and brain-like AGI), we should be talking about “training environments”, not “training data”—more like the RL agent paradigm of the late 2010s than like LLMs, at least in some ways.
I do agree that we shouldn’t trust LLMs to make good philosophical progress that goes way beyond what’s already in their human-created training data.
RE 1 – Let’s talk about feelings of friendship, compassion, and connection. These feelings are unnecessary for cooperation, right? Logical analysis of costs vs benefits of cooperation, including decision theory, reputational consequences, etc., are all you need for cooperation to happen. (See §2-3 here.) But for me and almost anyone, a future universe with no feelings of friendship, compassion, and connection in it seems like a bad thing that I don’t want to happen. I find it hard to believe that sufficient reflection would change my opinion on that [although I have some niggling concerns about technological progress]. “Selfishness” isn’t even a coherent concept unless the agent intrinsically wants something, and innate drives are upstream of what it wants, and those feelings of friendship, compassion etc. can be one of those innate drives, potentially a very strong one. Then, yes, there’s a further question about whom those feelings will be directed towards—AIs, humans, animals, teddy bears, or what? It needs to start with innate reflexes to certain stimuli, which then get smoothed out upon reflection. I think this is something where we’d need to think very carefully about the AI design, training environment, etc. It might be easier to think through what could go right and worng after developing a better understanding of human social instincts, and developing a more specific plan for the AI. Anyway, I certainly agree that there are important potential failure modes in this area.
RE 3 – I’m confused about how this plan would be betting the universe on a particular meta-ethical view, from your perspective. (I’m not expressing doubt, I’m just struggling to see things from outside my own viewpoint here.)
By the way, my perspective again is “this might be the least-bad plausible plan”, as opposed to “this is a great plan”. But to defend the “least-bad” claim, I would need to go through all the other possible plans that seem better, and why I don’t find them plausible (on my idiosyncratic models of AGI development). I mentioned my skepticism about AGI pause / anti-proliferation above, and I have a (hopefully) forthcoming post that should talk about how I’m thinking about corrigibility and other stuff. (But I’m also happy to chat about it here; it would probably help me flesh out that forthcoming post tbh.)
- Wei Dai 10 May 2025 7:33 UTC
  LW: 4 AF: 4
  0
  AF Parent
  
  For example, small groups of humans can invent grammatical languages from scratch, and of course historically humans invented science and tech and philosophy and so on from scratch.
  
  I think this could be part of a viable approach, for example if we figure out in detail how humans invented philosophy and use that knowledge to design/train an AI that we can have high justified confidence will be philosophically competent. I’m worried that in actual development of brain-like AGI, we will skip this part (because it’s too hard, or nobody pushes for it), and end up just assuming that the AGI will invent or learn philosophy because it’s brain-like. (And then it ends up not doing that because we didn’t give it some “secret sauce” that humans have.) And this does look fairly hard to me, because we don’t yet understand the nature of philosophy or what constitutes correct philosophical reasoning or philosophical competence, so how do we study these things in either humans or AIs?
  
  But for me and almost anyone, a future universe with no feelings of friendship, compassion, and connection in it seems like a bad thing that I don’t want to happen. I find it hard to believe that sufficient reflection would change my opinion on that [although I have some niggling concerns about technological progress].
  
  I find it pretty plausible that wireheading is what we’ll end up wanting, after sufficient reflection. (This could be literal wireheading, or something more complex like VR with artificial friends, connections, etc.) This seems to me to be the default, unless we have reasons to want to avoid it. Currently my reasons are 1. value uncertainty (maybe we’ll eventually find good intrinsic reasons to not want to wirehead) 2. opportunity costs (if I wirehead now, it’ll cost me in terms of both quality and quantity vs if I wait for more resources and security). But it seems foreseeable that both of these reasons will go away at some point, and we may not have found other good reasons to avoid wireheading by then.
  
  (Of course if the right values for us are to max out our wireheading, then we shouldn’t hand the universe off the AGIs that will want to max out their wireheading. Also, this is just the simplest example of how brain-like AGIs’ values could conflict with ours.)
  
  I’m confused about how this plan would be betting the universe on a particular meta-ethical view, from your perspective.
  
  You’re prioritizing instilling AGI with the right social instincts, and arguing that’s very important for getting the AGI to eventually converge on values that we’d consider good. But under different meta-ethical views, you should perhaps be prioritizing something else. For example, under moral realism, it doesn’t matter what social instincts the AGI starts out with, what’s more important is that it has the capacity to eventually find and be motivated by objective moral truths.
  
  From my perspective, in order to not bet the universe on a particular meta-ethical view, it seems that we need to either hold off on building AGI until we definitively solve metaethics (e.g., it’s no longer a contentious subject in academic philosophy), or have an approach/plan that will work out well regardless of which meta-ethical view turns out to be correct.
  
  By the way, my perspective again is “this might be the least-bad plausible plan”, as opposed to “this is a great plan”.
  
  Thanks, I appreciate this, but of course still feel compelled to speak up when I see areas where you seem overly optimistic and/or missing some potential risks/failure modes.