I think the single most important point is: Keep a paradigm with human-legible CoT. Most other points are downstream of that. If it is legible, it is possible and more likely to notice that it is not faithful and to build monitoring on top. It might be the single simple You Get About Five Words thing that might make it into regulation.
I think the single most important point is: Keep a paradigm with human-legible CoT. Most other points are downstream of that. If it is legible, it is possible and more likely to notice that it is not faithful and to build monitoring on top. It might be the single simple You Get About Five Words thing that might make it into regulation.
I just saw a method to make more parts of the model human-legible, addressing the main concern.