Do any of these recent papers within the last year change your view on interp impact for these theories? :1. Understanding misalignment (at least some initial insights): https://arxiv.org/html/2502.17424v2
2. Better prediction of future systems (interp for scaling):https://arxiv.org/abs/2303.13506
3. Auditing to reveal hidden objectives:https://www.anthropic.com/research/auditing-hidden-objectives
Do any of these recent papers within the last year change your view on interp impact for these theories? :
1. Understanding misalignment (at least some initial insights): https://arxiv.org/html/2502.17424v2
2. Better prediction of future systems (interp for scaling):
https://arxiv.org/abs/2303.13506
3. Auditing to reveal hidden objectives:
https://www.anthropic.com/research/auditing-hidden-objectives