azsantosk comments on We cannot directly choose an AGI’s utility function

azsantosk 24 Mar 2022 0:05 UTC
3 points
What we think is that we might someday build an AI advanced enough that it can, by itself, predict plans for given goal x, and execute them. Is this that otherworldly? Given current progress, I don’t think so.
I don’t think so either. AGIs will likely be capable of understanding what we mean by X and doing plans for exactly that if they want to help. Problem is the AGIs may have other goals in mind by this time.
As for re-inforcement learning, even it seems now impossible to build AGIs with utility functions on that paradigm, nothing gives us the assurance that that will be the paradigm used to be the first AGI.
Sure, it may be possible that some other paradigm allows us to have more control of the utility functions. User tailcalled mentioned John Wentworth’s research (which I will proceed to study as I haven’t done so in depth yet).
(Unless the first AGI can’t be told to do anything at all, but then we would already have lost the control problem.)
I’m afraid that this may be quite a likely outcome if we don’t make much progress in alignment research.
Regarding what the AGI will want then, I expect it to depend a lot on the training regime and on its internal motivation modules (somewhat analogous to the subcortical areas of the brain). My threat model is quite similar to the one defended by Steven Byrnes in articles such as this one.
In particular I think the AI developers will likely give the AGI “creativity modules” responsible for generating intrinsic reward whenever it finds out interesting patterns or abilities. This will help the AGI remain motivated and learning to solve harder and harder problems when outside reward is sparse, which I predict will be extremely useful to make the AGI more capable. But I expect the internalization of such intrinsic rewards to end up generating utility functions that are nearly unbounded in the value assigned to knowledge and computational power, and quite possibly hostile to us.
I don’t think all is lost though. Our brain provide us an example of a relatively-well aligned intelligence: our own higher reasoning in the telencephalon seems relatively well aligned with the evolutionary ancient primitive subcortical modules (not so much with evolution’s base objective of reproduction, though). Not sure how much work evolution had to align these two modules. I’ve heard at least one person arguing that maybe higher intelligence didn’t evolve before because of the difficulties of aligning it. If so, that would be pretty bad.
Also I’m somewhat more optimistic than others in the prospect of creating myopic AGIs that crave very much for short-term rewards that we do control. I think it might be possible (with a lot of effort) to keep such an AGI controlled in a box even if it is more intelligent than humans in general, and that such an AGI may help us with the overall control problem.
- superads91 24 Mar 2022 1:49 UTC
  2 points
  Parent
  “I’m afraid that this may be quite a likely outcome if we don’t make much progress in alignment research.”
  
  Ok, I understand better your position now. That is, that we shouldn’t worry so much about what to tell the genie in the lamp, because we probably won’t even have a say to begin with. Sorry for not quite getting there at first.
  
  That sounds reasonable to me.
  
  Personally I (also?) think that the right “values” and the right training is more important. After all, as Stuart Russell would say, building an advanced agent as an utility maximizer would always produce chaos anyway, since it would tend to set the remaining function variables that it is not maximizing to absurd parameters.
  - azsantosk 24 Mar 2022 2:56 UTC
    1 point
    Parent
    That is, that we shouldn’t worry so much about what to tell the genie in the lamp, because we probably won’t even have a say to begin with.
    I think you summarized it quite well, thanks! The idea written like that is more clear than what I wrote, so I’ll probably try to edit the article to include this claim explicitly like that. This really is what motivated me to write this post to begin with.
    Personally I (also?) think that the right “values” and the right training is more important.
    You can put the also, I agree with you.
    At the current state of confusion regarding this matter I think we should focus on how values might be shaped by the architecture and training regimes, and try to make progress on that even if we don’t know exactly what the human values are or what utility functions they represent.