What is the breakthrough: It was discovered that neural networks tend to produce “average” predictions on OOD inputs
This seems like somewhat good news for Goodhart’s curse being easier to solve! (Mostly I’m thinking about how much worse it would be if this wasn’t the case).
Yep, that’s the big one. It suggests that NNs aren’t doing anything very weird if we subject them to OOD inputs, which is evidence against the production of inner optimization, or at least not inner misaligned optimizers.
I actually now think the OOD paper is pretty weak evidence about inner optimizers.
I think the OOD paper tells you what happens as the low-level and mid-level features stop reliably/coherently firing because you’re adding in so much noise. Like, if my mental state got increasingly noised across all modalities, I think I’d probably adopt some constant policy too, because none of my coherent circuits/shards would be properly interfacing with the other parts of my brain. But I dont think that tells you much about alignment-relevant OOD.
I wonder if the result is dependent on the type of OOD.
If you are OOD by having less extractable information, then the results are intuitive. If you are OOD by having extreme extractable information or misleading information, then the results are unexpected.
Oh, I just read their Appendix A: “Instances Where “Reversion to the OCS” Does Not Hold” Outputting the average prediction is indeed not the only behavior OOD. It seems that there are different types of OOD regimes.
This seems like somewhat good news for Goodhart’s curse being easier to solve! (Mostly I’m thinking about how much worse it would be if this wasn’t the case).
Average predictions OOD is also evidence against goal-directed inner optimizers, in some sense; though it depends what counts as “OOD.”
Yep, that’s the big one. It suggests that NNs aren’t doing anything very weird if we subject them to OOD inputs, which is evidence against the production of inner optimization, or at least not inner misaligned optimizers.
I actually now think the OOD paper is pretty weak evidence about inner optimizers.
I think the OOD paper tells you what happens as the low-level and mid-level features stop reliably/coherently firing because you’re adding in so much noise. Like, if my mental state got increasingly noised across all modalities, I think I’d probably adopt some constant policy too, because none of my coherent circuits/shards would be properly interfacing with the other parts of my brain. But I dont think that tells you much about alignment-relevant OOD.
I wonder if the result is dependent on the type of OOD.
If you are OOD by having less extractable information, then the results are intuitive.
If you are OOD by having extreme extractable information or misleading information, then the results are unexpected.
Oh, I just read their Appendix A: “Instances Where “Reversion to the OCS” Does Not Hold”
Outputting the average prediction is indeed not the only behavior OOD. It seems that there are different types of OOD regimes.