Tapatakt comments on Reward Hacking from a Causal Perspective

Tapatakt 30 Apr 2024 13:14 UTC
1 point
0
How can we combine behavioural experiments with mechanistic interpretability to infer an agent’s subjective causal model? The next post will say more about this.
There is no next post. Can I read about it somewhere anyway?