Daniel Kokotajlo comments on (My understanding of) What Everyone in Technical Alignment is Doing and Why

Daniel Kokotajlo 2 Sep 2022 20:37 UTC
LW: 6 AF: 2
−1
AF
Thanks!

I’m not sure the framing is helpful either, but reading Turner’s linked appendix it does seem like various people are making some sort of mistake that can be summarized as “they seem to think the policy / trained network should be understood as trying to get reward, as preferring higher-reward outcomes, as targeting reward...” (And Turner says he himself was one of them despite doing a PhD in RL theory) Like I said above I think that probably there’s room for improvement here—if everyone defined their terms better this problem would clear up and go away. I see Turner’s post as movement in this direction but by no means the end of the journey.

Re your first argument: If I understand you correctly, you are saying that if your AI design involves something like monte-carlo tree search using a reward-estimator module (Idk what the technical term for that is) and the reward-estimator module is just trained to predict reward, then it’s fair to describe the system as optimizing for the goal of reward. Yep that seems right to me, modulo concerns about inner alignment failures in the reward-estimator module. I don’t see this as contradicting Alex Turner’s claims but maybe it does.

Re your second argument, the appeal to authority: I suppose in a vacuum, not having thought about it myself or heard any halfway decent arguments, I’d defer to the RL field on this matter. But I have thought about it a bit myself and I have heard some decent arguments, and that effect is stronger than the deference effect for me, and I think this is justified.
- David Scott Krueger 3 Sep 2022 20:58 UTC
  LW: 4 AF: 3
  1
  AF Parent
  RE appeal to authority: I mostly mentioned it because you asked for an argument and I figured I would just provide any decent ones I thought of OTMH. But I have not provided anything close to my full thoughts on the matter, and probably won’t, due to bandwidth.