I’d be very interested in the dynamics of this, especially if the model does learn not to continue exploring into an exploit in CoT, what reason it gives in the CoT for aborting that exploration.
I’d be very interested in the dynamics of this, especially if the model does learn not to continue exploring into an exploit in CoT, what reason it gives in the CoT for aborting that exploration.