Rohin Shah comments on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Rohin Shah 16 Jul 2025 14:42 UTC
2 points
0
Sure, but the same argument would suggest that the model’s thoughts follow the same sort of reasoning that can be seen in pretraining, i.e. human-like reasoning that presumably is monitorable