Adele Lopez comments on AI Alignment Breakthroughs this week (10/08/23)

Adele Lopez 9 Oct 2023 2:55 UTC
2 points
0
What is the breakthrough: It was discovered that neural networks tend to produce “average” predictions on OOD inputs
This seems like somewhat good news for Goodhart’s curse being easier to solve! (Mostly I’m thinking about how much worse it would be if this wasn’t the case).
- TurnTrout 9 Oct 2023 5:46 UTC
  4 points
  0
  Parent
  Average predictions OOD is also evidence against goal-directed inner optimizers, in some sense; though it depends what counts as “OOD.”
  - Noosphere89 12 Oct 2023 14:23 UTC
    1 point
    0
    Parent
    Yep, that’s the big one. It suggests that NNs aren’t doing anything very weird if we subject them to OOD inputs, which is evidence against the production of inner optimization, or at least not inner misaligned optimizers.
    - TurnTrout 16 Oct 2023 18:51 UTC
      4 points
      0
      Parent
      I actually now think the OOD paper is pretty weak evidence about inner optimizers.
      I think the OOD paper tells you what happens as the low-level and mid-level features stop reliably/coherently firing because you’re adding in so much noise. Like, if my mental state got increasingly noised across all modalities, I think I’d probably adopt some constant policy too, because none of my coherent circuits/shards would be properly interfacing with the other parts of my brain. But I dont think that tells you much about alignment-relevant OOD.
- Maxime Riché 9 Oct 2023 9:31 UTC
  3 points
  0
  Parent
  I wonder if the result is dependent on the type of OOD.
  
  If you are OOD by having less extractable information, then the results are intuitive.
  If you are OOD by having extreme extractable information or misleading information, then the results are unexpected.
  
  Oh, I just read their Appendix A: “Instances Where “Reversion to the OCS” Does Not Hold”
  Outputting the average prediction is indeed not the only behavior OOD. It seems that there are different types of OOD regimes.

Adele Lopez comments on AI Alignment Breakthroughs this week (10/​08/​23)

Adele Lopez comments on AI Alignment Breakthroughs this week (10/08/23)