anaguma comments on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

anaguma 15 Jul 2025 23:28 UTC
3 points
2
Isn’t there a KL divergence term from the base model, like is done with RLHF?
- Rohin Shah 16 Jul 2025 14:42 UTC
  2 points
  0
  Parent
  Sure, but the same argument would suggest that the model’s thoughts follow the same sort of reasoning that can be seen in pretraining, i.e. human-like reasoning that presumably is monitorable