These are great suggestions—do you have a sense/botec on how expensive it would be, end to end, to do a pessimization training run? Idk if the time/staff attention makes it a non-starter except for toy models.
(Of course the question of its a non-starter is downstream of internal lab political will, but knowing the cost would inform how difficult it would be for interested safety team members to run such an experiment)
Hi Wei, my apologies for my late reply, I only saw your comment after returning to this post when I saw that Anthropic mentioned in their recent blog post that perhaps models will be sufficiently wise to not attempt RSI.
I agree that working on good epistemics, to outpace capability improvements in general, would require 1.) changing the dynamic, 2.) significant investment in projects to curate data / feedback on good fuzzy reasoning, or 3.) using extensive scaffolding (aka ‘meta systems’) to elicit better epistemics from models with jagged performance in these domains. As an example I think the competition that FLF is running here might lend itself to prototypes of better meta-systems, which could ladder up to better epistemics. (I’m also excited about 2 though I don’t have ready examples of what that might look like)