If the NTMs get to look at the predictions of the other NTMs when making their own predictions (there’s probably a fixed-point way to do this), then maybe there’s one out there that copies one of the versions of 3 but makes adjustments for 3’s bad decision theory.
Why not say “If X is a model using a bad decision theory, there is a closely related model X’ that uses a better decision theory and makes better predictions. So once we have some examples that distinguish the two cases, we will use X’ rather than X.”
Sometimes this kind of argument doesn’t work and you can get tighter guarantees by considering the space of modifications (by coincidence this exact situation arises here), but I don’t see why this case in particular would bring up that issue.
Suppose there are N binary dimensions that predictors can vary on. Then we’d need 2N predictors to cover every possibility. On the other hand, we would only need to consider N possible modifications to a predictor. Of course, if the dimensions factor that nicely, then you can probably make enough assumptions about the hypothesis class that you can learn from the 2N experts efficiently.
Overall it seems nicer to have a guarantee of the form “if there is a predictable bias in the predictions, then the system will correct this bias” rather than “if there is a strictly better predictor than a bad predictor, then the system will listen to the good predictor”, since it allows capabilities to be distributed among predictors instead of needing to be concentrated in a single predictor. But maybe things work anyway for the reason you gave.
Why not say “If X is a model using a bad decision theory, there is a closely related model X’ that uses a better decision theory and makes better predictions. So once we have some examples that distinguish the two cases, we will use X’ rather than X.”
Sometimes this kind of argument doesn’t work and you can get tighter guarantees by considering the space of modifications (by coincidence this exact situation arises here), but I don’t see why this case in particular would bring up that issue.
Suppose there are N binary dimensions that predictors can vary on. Then we’d need 2N predictors to cover every possibility. On the other hand, we would only need to consider N possible modifications to a predictor. Of course, if the dimensions factor that nicely, then you can probably make enough assumptions about the hypothesis class that you can learn from the 2N experts efficiently.
Overall it seems nicer to have a guarantee of the form “if there is a predictable bias in the predictions, then the system will correct this bias” rather than “if there is a strictly better predictor than a bad predictor, then the system will listen to the good predictor”, since it allows capabilities to be distributed among predictors instead of needing to be concentrated in a single predictor. But maybe things work anyway for the reason you gave.