Steven Byrnes comments on Instruction-following AGI is easier and more likely than value aligned AGI

Steven Byrnes 3 Jun 2024 14:15 UTC
LW: 10 AF: 5
8
AF
I guess I’m concerned that there’s some kind of “conservation law for wisdom / folly / scout mindset” in the age of instruction-following AI. If people don’t already have wisdom / scout mindset, I’m concerned that “Instruct the AGI to tell you the truth” won’t create it.
For example, if you ask the AI a question for which there’s no cheap and immediate ground truth / consequences (“Which politician should I elect?”, “Will this alignment approach scale to superintelligence?”), then the AI can say what the person wants to hear, or the AI can say what’s true.
Likewise, if there’s something worth doing that might violate conventional wisdom and make you look foolish, and ask the AI for a recommendation, the AI can recommend the easy thing that the person wants to hear, or the AI can recommend the hard annoying thing that the person doesn’t want to hear.
If people are not really deeply motivated to hear things that they don’t want to hear, I’m skeptical that instruction-following AI can change that. Here are three ways for things to go wrong:
- During training (e.g. RLHF), presumably people will upvote the AIs for providing answers that they want to hear, even if they ask for the truth, resulting in AIs that behave that way;
- During usage, people could just decide that they don’t trust the AI on thus-and-such type of question. I’m sure they could easily come up with a rationalization! E.g. “well it’s perfectly normal and expected for AIs to be very smart at questions for which there’s a cheap and immediate ground truth, while being lousy at questions for which there isn’t! Like, how would it even learn the latter during training? And as for ‘should’ questions involving tradeoffs, why would we even trust it on that anyway?” The AIs won’t be omniscient anyway; mistrusting them in certain matters wouldn’t be crazy.
- In a competitive marketplace, if one company provides an AI that tells people what they want to hear in cases where there’s no immediate consequences, and other company provides an AI that tells people hard truths, people may pick the former.
(To be clear, if an AI is saying things that the person wants to hear in certain cases, the AI will still say that it’s telling the truth, and in fact the AI will probably even believe that it’s telling the truth! …assuming it’s a type of AI that has “beliefs”.)
(I think certain things like debate or training-on-prediction markets might help a bit with the first bullet point, and are well worth investigating for that purpose; but they wouldn’t help with the other two bullet points.)
So anyway, my background belief here is that defending the world against out-of-control AGIs will require drastic, unpleasant, and norm-violating actions. So then the two options to get there would be: (1) people with a lot of scout mindset / wisdom etc. are the ones developing and using instruction-following AGIs, and they take those actions; or (2) make non-instruction-following AGIs, and those AGIs themselves are the ones taking those actions without asking any human’s permission. E.g. “pivotal acts” would be (1), whereas AGIs that deeply care about humans and the future would be (2). I think I’m more into (2) than you both because I’m (even) more skeptical about (1) than you are, and because I’m less skeptical about (2) than you. But it’s hard to say; I have a lot of uncertainty. (We’ve talked about this before.)
Anyway, I guess I think it’s worth doing technical research towards both instruction-following-AI and AI-with-good-values in parallel.
Regardless, thanks for writing this.
- Seth Herd 3 Jun 2024 19:26 UTC
  4 points
  0
  Parent
  It sounds like you’re thinking of mass deployment. I think if every average joe has control of an AGI capable of recursive self-improvement, we are all dead.
  
  I’m assuming that whoever develops this might allow others to use parts of its capacities, but definitely not all of them.
  
  So we’re in a position where the actual principal(s) are among the smarter and at least not bottom-of-the-barrel impulsive and foolish people. Whether that’s good enough, who knows.
  
  So your points about ways the AIs wisdom will be ignored should mostly be limited to the “safe” limited versions. I totally agree that the wisdom of the AGI will be limited. But it will grow as its capabilities grow. I’m definitely anticipating it learning after deployment, not just with retraining of its base LLMs. That’s not hard to implement, and it’s a good way to leverage a different type of human training.
  
  I agree that defendingg the world will require some sort of pivotal act. Optimistically, this would be something like major governments agreeing to outlaw further development of sapient AGIs, and then enforcing that using their AGIs superior capabilities. And yes, that’s creepy. I’d far prefer your option 2, value-aligned, friendly sovereign AGI. I’ve always thought that was the win condition if we solve alignment. But now it’s seeming vastly more likely we’re stuck with option 1. It seems safer than attempting 2 until we have a better option, and appealing to those in charge of AGI projects.
  
  I don’t see a better option on the table, even if language model agents don’t happen sooner than brainlike AGI that would allow your alignment plans to work. Your plan for mediocre alignment seems solid, but I don’t think the stability problem is solved, so aligning it to human flourishing might well go bad as it updates its understandingn of what that means. Maybe reflective stability would be adequate? If we analyzed it some more and decided it was, I’d prefer that plan. Otherwise I’d want to align even brainlike AGI to just follow instructions, so that it can be shut down if it starts going off-course.
  
  I guess the same logic applies to language model agents. You could just give it a top-level goal like “work for human flourishing”, and if reflective stability is adequate and there’s no huge problem with that definition, it would work. But who’s going to launch that instead of keeping it under their control, at least until they’ve worked with it for a while?