Chris_Leong comments on Low-stakes alignment

Chris_Leong 31 Oct 2023 3:17 UTC
LW: 2 AF: 1
0
AF
- If I’d tried to assume that models are “trying” to get a low value for the training loss, I might have ended up relying on our ability to incentivize the model to make very long-term predictions. But I think that approach is basically a dead end.
Why do you believe that this is a dead-end?