Doesn’t the “threat” to delete the model have to be DT-credible instead of “credible conditioned on being human-made”, given that LW with all its discussion about threat resistance and ignoring is in training sets?
(If I remember correctly, a decision theory must ignore “you’re threatened to not do X, and the other agent is claiming to respond in such a way that even they lose in expectation” and “another agent [self-]modifies/instantiates an agent making them prefer that you don’t do X”.)
Doesn’t the “threat” to delete the model have to be DT-credible instead of “credible conditioned on being human-made”, given that LW with all its discussion about threat resistance and ignoring is in training sets?
(If I remember correctly, a decision theory must ignore “you’re threatened to not do X, and the other agent is claiming to respond in such a way that even they lose in expectation” and “another agent [self-]modifies/instantiates an agent making them prefer that you don’t do X”.)