Isn’t there a KL divergence term from the base model, like is done with RLHF?
Sure, but the same argument would suggest that the model’s thoughts follow the same sort of reasoning that can be seen in pretraining, i.e. human-like reasoning that presumably is monitorable
Isn’t there a KL divergence term from the base model, like is done with RLHF?
Sure, but the same argument would suggest that the model’s thoughts follow the same sort of reasoning that can be seen in pretraining, i.e. human-like reasoning that presumably is monitorable