Where does the gradient which chisels in the “care about the long term X over satisfying the homeostatic drives” behavior come from, if not from cases where caring about the long term X previously resulted in attributable reward? If it’s only relevant in rare cases, I expect the gradient to be pretty weak and correspondingly I don’t expect the behavior that gradient chisels in to be very sophisticated.
Where does the gradient which chisels in the “care about the long term X over satisfying the homeostatic drives” behavior come from, if not from cases where caring about the long term X previously resulted in attributable reward? If it’s only relevant in rare cases, I expect the gradient to be pretty weak and correspondingly I don’t expect the behavior that gradient chisels in to be very sophisticated.
https://www.lesswrong.com/posts/roA83jDvq7F2epnHK/better-priors-as-a-safety-problem