Steven Byrnes comments on Evaluating the historical value misspecification argument

Steven Byrnes 15 Apr 2025 17:34 UTC
LW: 2 AF: 2
0
AF
I’m not too interested in litigating what other people were saying in 2015, but OP is claiming (at least in the comments) that “RLHF’d foundation models seem to have common-sense human morality, including human-like moral reasoning and reflection” is evidence for “we’ve made progress on outer alignment”. If so, here are two different ways to flesh that out:
1. An RLHF’d foundation model acts as the judge / utility function; and some separate system comes up with plans that optimize it—a.k.a. “you just need to build a function maximizer that allows you to robustly maximize the utility function that you’ve specified”.
  1. I think this plan fails because RLHF’d foundation models have adversarial examples today, and will continue to have adversarial examples into the future. (To be clear, humans have adversarial examples too, e.g. drugs & brainwashing.)
2. There is no “separate system”, but rather an RLHF’d foundation model (or something like it) is the whole system that we’re talking about here. For example, we may note that, if you hook up an RLHF’d foundation model to tools and actuators, then it will actually use those tools and actuators in accordance with common-sense morality etc.
(I think 2 is the main intuition driving the OP, and 1 was a comments-section derailment.)
As for 2:
- I’m sympathetic to the argument that this system might not be dangerous, but I think its load-bearing ingredient is that pretraining leads to foundation models tending to do intelligent things primarily by emitting human-like outputs for human-like reasons, thanks in large part to self-supervised pretraining on internet text. Let’s call that “imitative learning”.
- Indeed, here’s a 2018 post where Eliezer (as I read it) implies that he hadn’t really been thinking about imitative learning before (he calls it an “interesting-to-me idea”), and suggests that imitative learning might “bypass the usual dooms of reinforcement learning”. So I think there is a real update here—if you believe that imitative learning can scale to real AGI.
- …But the pushback (from Rob and others) in the comments is mostly coming from a mindset where they don’t believe that imitative learning can scale to real AGI. I think the commenters are failing to articulating this mindset well, but I think they are in various places leaning on certain intuitions about how future AGI will work (e.g. beliefs deeply separate from goals), and these intuitions are incompatible with imitative learning being the primary source of optimization power in the AGI (as it is today, I claim).
- (A moderate position is that imitative learning will be less and less relevant as e.g. o1-style RL post-training becomes a bigger relative contribution to the trained weights; this would presumably lead to increased future capabilities hand-in-hand with increased future risk of egregious scheming. For my part, I subscribe to the more radical theory that our situation is even worse than that: I think future powerful AGI will be built via a different AI training paradigm that basically throws out the imitative learning part altogether.)
- I have a forthcoming post (hopefully) that will discuss this much more and better.