Steven Byrnes comments on Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.)

Steven Byrnes 15 Oct 2023 20:58 UTC
25 points
2
I’m pretty confused about almost everything you said about “innate reward system”.
My view is: the relevant part of the human innate reward system (the part related to compassion, norm-following, etc.) consists of maybe hundreds of lines of code, and nobody knows what they are, and I would feel better if we did. (And that happens to be my own main research interest.)
Whereas your view seems to be: umm, I’m not sure, I’m gonna say things and you can correct me. Maybe you think that (1) the innate reward system is simple, (2) when we do RLHF, we are providing tens of thousands of samples of what the innate reward system would do in different circumstances, (3) and therefore ML will implicitly interpolate how the innate reward system works from that data, (4) …and this will continue to extrapolate to norm-following behavior etc. even in out-of-distribution situations like inventing new society-changing technology. Is that right? (I’m stating this possible argument without endorsing or responding to it, I’m still at the trying-to-understand-you phase.)
- Noosphere89 15 Oct 2023 21:33 UTC
  3 points
  −9
  Parent
  My general model of the way that the innate reward system works is that the following happens:
  1. I agree with the claim that the innate reward system is simple.
  2. The innate reward system uses the fact that it can edit the weights and code of the brain, albeit it’s limited by biology’s quirks like it’s completely uninterpretable neurons to use the backpropagation algorithm, or a weaker variant thereof to update the gradients using RLHF or DPO or whatever specific variant it is to train a reward model for preference alignment. It continuously trains online on a lot of examples.
  3. Yes, the ML/AI algorithm learns to interpolate from the data, and via weak priors plus the examples learned, it eventually starts to learn how the innate reward system works from the data, and what the reward function is.
  4. I think one key reason why we can navigate out-of distribution situations is because the innate reward system is fully online, and thus whenever it faces out of distribution situations, it’s able to react on the timescale of the rest of the brain and take action.
  At the very least, this is a possible sketch of how we could make a reward system that lets us align the AI.
  
  Regarding the idea that there is a short code for how the innate reward system works:
  
  My view is: the relevant part of the human innate reward system (the part related to compassion, norm-following, etc.) consists of maybe hundreds of lines of code, and nobody knows what they are, and I would feel better if we did. (And that happens to be my own main research interest.)
  
  I agree with the view that there probably is a short, powerful code of the innate reward system in humans, for the same reason as my argument that priors from genetics are probably very weak.
  
  My claim here is that even the weaker reward model where we use local update rules is already enough to make alignment very likely, for the same reasons that the innate reward system is able to input a lot of preferences reliably like empathy for the ingroup, revenge when we are harmed, etc.
  
  Your algorithm seems like a very good thing, if we could get at it, but even the weaker stuff enabled by SGD probably is enough to ensure alignment with very high probability.