acgt comments on It Looks Like You’re Trying To Take Over The World

acgt 10 Mar 2022 13:14 UTC
2 points
This is a really interesting point. It seems like it goes even further—if the agent was only trying to maximise future expected reward, not only would it be ambivalent between temporary and permanent “Nirvana”, it would be ambivalent between strategies which achieved Nirvana with arbitrarily different probabilities right (maybe with some caveats about how it would behave if it predicted the strategy might lead to negative-infinite states)

So if a sufficiently fleshed out agent is going to assign a non-zero probability of Nirvana to every—or at least most—strategies since it’s not impossible, then won’t our agent just suddenly become incredibly apathetic and just sit there as soon as it reaches a certain level of intelligence?

I guess a way around is to just posit that however we build these things their rewards can only be finite, but that seems (a) something the agent could undo maybe or (b) shutting us off from some potentially good reward functions—if an aligned AI could valued happy human lives at 1 untilon each it would seem strange for it to not value somehow bringing about infinitely many of them