Going by the Risks from Learned Optimization sequence, it’s not clear if mesa-optimization is a big threat if the model continues to be updated throughout deployment. I suspect this has been discussed before (links welcome), but I didn’t find anything with a quick search.
Lifelong/online/continual learning is popular and could be the norm in future. I’m interested in how that (and other learning paradigms, if relevant) fits into beliefs about mesa-optimization risk.
If you believe the arguments hold up under a lifelong learning paradigm: is that because there could still be enough time between updates for the mesa-optimizer to defect, or some other reason? If you believe the train-test paradigm is likely to stick around, why is that?
[Question] Do mesa-optimizer risk arguments rely on the train-test paradigm?
Going by the Risks from Learned Optimization sequence, it’s not clear if mesa-optimization is a big threat if the model continues to be updated throughout deployment. I suspect this has been discussed before (links welcome), but I didn’t find anything with a quick search.
Lifelong/online/continual learning is popular and could be the norm in future. I’m interested in how that (and other learning paradigms, if relevant) fits into beliefs about mesa-optimization risk.
If you believe the arguments hold up under a lifelong learning paradigm: is that because there could still be enough time between updates for the mesa-optimizer to defect, or some other reason? If you believe the train-test paradigm is likely to stick around, why is that?