I only read the LW version not the paper, but this seems like important work to me and I’m glad you’re doing it! What did you make of these two recent papers?
I have done some work on the policy side of this (whether we should/how we could enforce CoT monitorability on AI developers, or at least gain transparency into how monitorable SOTA models are). Lmk if ever it would be useful to talk about that, otherwise I will be keen to see where this line of work ends up!
I liked https://firstscattering.com/p/red-lines-for-recursive-self-improvement as a quick initial discussion of possible places to draw a line.