Would I think for ten thousand years?

Some AI safety ideas del­e­gate key de­ci­sions to our ideal­ised selves. This is some­times phrased as “al­low­ing ver­sions of your­self to think for ten thou­sand years”, or similar sen­ti­ments.

Oc­ca­sion­ally, when I’ve ob­jected to these ideas, it’s been pointed out that any at­tempt to con­struct a safe AI de­sign would in­volve a lot of think­ing, so there­fore there can’t be any­thing wrong with del­e­gat­ing this think­ing to an al­gorithm or an al­gorith­mic ver­sion of my­self.

But there is a ten­sion be­tween “more think­ing” in the sense of “solve spe­cific prob­lems” and in the sense of “change your own val­ues”.

An un­re­stricted “do what­ever a copy of Stu­art Arm­strong would have done af­ter he thought about moral­ity for ten thou­sand years” seems to pos­i­tively beg for value drift (wors­ened by the difficulty in defin­ing what we mean by “a copy of Stu­art Arm­strong [...] thought [...] for ten thou­sand years”).

A more nar­row “have ten copies of Stu­art think about these ten the­o­rems for a sub­jec­tive week each and give me a proof or counter-ex­am­ple” seems much safer.

In be­tween those two ex­tremes, how do we as­sess the de­gree of value drift and its po­ten­tial im­por­tance to the ques­tion be­ing asked? Ideally, we’d have a the­ory of hu­man val­ues to help dis­t­in­guish the cases. Even with­out that, we can use some com­mon sense on is­sues like length of thought, na­ture of prob­lem, band­width of out­put, and so on.