Learning a diverse policy ensemble as an exploration hacking counter-measure
haven’t looked deeply at this paper (Poly-EPO: Training Exploratory Reasoning Models) but generally interested in the idea of using RL ensemble diversity as a counter-measure to exploration hacking (though this would probably be too expensive and or too finicky in practice)
Learning a diverse policy ensemble as an exploration hacking counter-measure
haven’t looked deeply at this paper (Poly-EPO: Training Exploratory Reasoning Models) but generally interested in the idea of using RL ensemble diversity as a counter-measure to exploration hacking (though this would probably be too expensive and or too finicky in practice)