Oliver Daniels comments on Oliver Daniels-Koch’s Shortform

Oliver Daniels 22 Apr 2026 4:40 UTC
1 point
0
Learning a diverse policy ensemble as an exploration hacking counter-measure
haven’t looked deeply at this paper (Poly-EPO: Training Exploratory Reasoning Models) but generally interested in the idea of using RL ensemble diversity as a counter-measure to exploration hacking (though this would probably be too expensive and or too finicky in practice)