[Question] Do mesa-optimizer risk arguments rely on the train-test paradigm?

Going by the Risks from Learned Optimization sequence, it’s not clear if mesa-optimization is a big threat if the model continues to be updated throughout deployment. I suspect this has been discussed before (links welcome), but I didn’t find anything with a quick search.

Lifelong/​online/​continual learning is popular and could be the norm in future. I’m interested in how that (and other learning paradigms, if relevant) fits into beliefs about mesa-optimization risk.

If you believe the arguments hold up under a lifelong learning paradigm: is that because there could still be enough time between updates for the mesa-optimizer to defect, or some other reason? If you believe the train-test paradigm is likely to stick around, why is that?

No comments.