Sure, but the same argument would suggest that the model’s thoughts follow the same sort of reasoning that can be seen in pretraining, i.e. human-like reasoning that presumably is monitorable
Sure, but the same argument would suggest that the model’s thoughts follow the same sort of reasoning that can be seen in pretraining, i.e. human-like reasoning that presumably is monitorable