Kerrigan comments on AGI Safety FAQ / all-dumb-questions-allowed thread

Kerrigan 17 Dec 2022 3:39 UTC
1 point
0
How can an agent have a utility function that references a value in the environment, and actually care about the state of the environment, as opposed to only caring about the reward signal in its mind? Wouldn’t the knowledge of the state of the environment be in its mind, which can be hackable and susceptible to wire heading?
- mruwnik 17 Dec 2022 17:43 UTC
  1 point
  0
  Parent
  Yes, exactly. This is sort of the whole point.
  A basic answer is that if it actually cares about its goals and can think about them, it’ll notice that it should also care about the state of the environment, as otherwise it’s liable to not achieve its goals. Which is pretty much why rationality is valuable and the main lesson of the sequences.
  Check out inner alignment and shard theory for a lot of confusing info on this topic.
  - Kerrigan 26 Aug 2023 20:10 UTC
    1 point
    −1
    Parent
    Will it think that goals are arbitrary, and that the only thing it should care about is its pleasure-pain axis? And then it will lose concern for the state of the environment?
    - mruwnik 2 Sep 2023 10:06 UTC
      1 point
      0
      Parent
      You’re adding a lot of extra assumptions here, a couple being:
      there is a problem with having arbitrary goals
      it has a pleasure-pain axis
      it notices it has a pleasure-pain axis
      it cares about its pleasure-pain axis
      its pleasure-pain axis is independent of its understanding of the state of the environment
      The main problem of inner alignment is making an agent want to do what you want it to do (as opposed to even understanding what you want it to do). Which is an unsolved problem.
      Although I’m criticizing your specific criticism, my main issue with it is that it’s a very specific failure mode, which is unlikely to appear, because it requires a lot of other things which are also unlikely. That being said, you’ve provided a good example of WHY inner alignment is a big problem, i.e. it’s very hard to keep something following the goals you set it, especially when it can think for itself and change its mind.

Kerrigan comments on AGI Safety FAQ /​ all-dumb-questions-allowed thread

Kerrigan comments on AGI Safety FAQ / all-dumb-questions-allowed thread