I discussed this with Benja at a previous MIRIx workshop and I don’t remember exactly what we concluded, but I think it mostly works, it just requires that people behave sensibly when they get scrubbed predictions.
Now that I think about it: to handle cases when people don’t behave that sensibly with scrubbed predictions, maybe we want some kind of sequence of oracles, where oracle 0 outputs nothing, and oracle n+1 outputs what would happen if it were replaced with oracle n. We could take the limit as n approaches infinity, but then we don’t know that much about which fixed point we will get (it will be controlled by subtle feedback loops), so maybe we want something like n=3 being most probable (although we will want to make n random between 0 and 3 so it’s meaningful to condition on n=0, n=1, n=2).
I discussed this with Benja at a previous MIRIx workshop and I don’t remember exactly what we concluded, but I think it mostly works, it just requires that people behave sensibly when they get scrubbed predictions.
Now that I think about it: to handle cases when people don’t behave that sensibly with scrubbed predictions, maybe we want some kind of sequence of oracles, where oracle 0 outputs nothing, and oracle n+1 outputs what would happen if it were replaced with oracle n. We could take the limit as n approaches infinity, but then we don’t know that much about which fixed point we will get (it will be controlled by subtle feedback loops), so maybe we want something like n=3 being most probable (although we will want to make n random between 0 and 3 so it’s meaningful to condition on n=0, n=1, n=2).