habryka comments on Daniel Kokotajlo’s Shortform

habryka 9 Jul 2025 19:42 UTC
7 points
2
I mean, I find the whole “models don’t want rewards, they want proxies of rewards” conversation kind of pointless because nothing ever perfectly matches anything else, so I agree that in a common-sense way it’s fine to express this as “wanting reward”, but also, I think the people who care a lot about the distinction of proxies of rewards and actual reward would feel justifiedly kind of misunderstood by this.
- Daniel Kokotajlo 9 Jul 2025 20:34 UTC
  8 points
  2
  Parent
  I think I agree with “nothing ever perfectly matches anything else” and in particular, philosophically, there are many different precissifications of “reward/reinforcement” which are conceptually distinct and it’s unclear which one if any a reward-seeking AI would seek. E.g. is it about a reward counter on a GPU somewhere going up, or is it about the backpropagation actually happening?