I think this is sort of the idea behind a satisficer. Make something that basically never tries too hard, therefore will never reach up to the “conquer the world” class of solutions as they’re way too extreme and you can do good enough with far less. That said, I’m not sure if satisficers are actually proven to be fully safe either.
Something like this is argued to be why humans are frankly exceptionally well aligned to basic homeostatic drives, and the only real failure modes that happened are basically obesity, drugs and maybe alcohol as things that misaligned us with basic needs, as hedonic treadmills/loops essentially tame the RL part of us, and make sure that reward isn’t the optimization target in practice, like TurnTrout’s post below:
I think this is sort of the idea behind a satisficer. Make something that basically never tries too hard, therefore will never reach up to the “conquer the world” class of solutions as they’re way too extreme and you can do good enough with far less. That said, I’m not sure if satisficers are actually proven to be fully safe either.
Something like this is argued to be why humans are frankly exceptionally well aligned to basic homeostatic drives, and the only real failure modes that happened are basically obesity, drugs and maybe alcohol as things that misaligned us with basic needs, as hedonic treadmills/loops essentially tame the RL part of us, and make sure that reward isn’t the optimization target in practice, like TurnTrout’s post below:
https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target
Similarly, 2 beren posts below explain how the PID control loop may be helpful for alignment:
https://www.lesswrong.com/posts/3mwfyLpnYqhqvprbb/hedonic-loops-and-taming-rl
https://www.beren.io/2022-11-29-Preventing-Goodheart-with-homeostatic-rewards/