I suppose I constrained myself to producing an algorithm/setup where the asymptotic benignity result followed from reasons that don’t require dangerous behavior in the interim.
I think my point is this:
The intuitive thing you are aiming at is stronger than what the theorem establishes (understandably!)
You probably don’t need the memory trick to establish the theorem itself.
Even with the memory trick, I’m not convinced you meet the stronger criterion. There are a lot of other things similar to memory that can cause trouble—the theorem is able to avoid them only because of the same unsatisfying asymptotic feature that would have caused it to avoid memory-based models even without the amnesia.
the theorem is able to avoid them only because of the same unsatisfying asymptotic feature that would have caused it to avoid memory-based models even without the amnesia
This is a conceptual approach I hadn’t considered before—thank you. I don’t think it’s true in this case. Let’s be concrete: the asymptotic feature that would have caused it to avoid memory-based models even without amnesia is trial and error, applied to unsafe policies. Every section of the proof, however, can be thought of as making off-policy predictions behave. The real result of the paper would then be “Asymptotic Benignity, proven in a way that involves off-policy predictions approaching their benign output without ever being tested”. So while there might be malign world-models of a different flavor to the memory-based ones, I don’t think the way this theorem treats them is unsatisfying.
I think my point is this:
The intuitive thing you are aiming at is stronger than what the theorem establishes (understandably!)
You probably don’t need the memory trick to establish the theorem itself.
Even with the memory trick, I’m not convinced you meet the stronger criterion. There are a lot of other things similar to memory that can cause trouble—the theorem is able to avoid them only because of the same unsatisfying asymptotic feature that would have caused it to avoid memory-based models even without the amnesia.
This is a conceptual approach I hadn’t considered before—thank you. I don’t think it’s true in this case. Let’s be concrete: the asymptotic feature that would have caused it to avoid memory-based models even without amnesia is trial and error, applied to unsafe policies. Every section of the proof, however, can be thought of as making off-policy predictions behave. The real result of the paper would then be “Asymptotic Benignity, proven in a way that involves off-policy predictions approaching their benign output without ever being tested”. So while there might be malign world-models of a different flavor to the memory-based ones, I don’t think the way this theorem treats them is unsatisfying.