Abstract answer: Maybe it doesn’t transfer from LM’s to AGI, but advances the state of knowledge in the field in a way that makes it easier to find something that works on AGI. Maybe it doesn’t transfer to (say) a pure RL agent, but it’s easier to make a sufficiently good LM into an AGI than it looks. Maybe it does just transfer. Obviously there are also outcomes where it turns out to be useless, I’m just saying it looks positive in expectation.
Concrete answer: Adversarial examples have been with us throughout the history of neural nets, and basically the only thing we’ve really found to deal with them is “generate adversarial examples during training and train against them”, and even that doesn’t really work.
If we look at the things that let LMs do IMO problems, the really fundamental innovations (which were pre-existing, I think) are “RL on chain of thought” and “make some kind of good scaffold for the search process that lets you save partial insights instead of going fully parallel on the entire problem” and maybe “LLM as verifier”. (Disclaimer: I don’t know everything the labs did to achieve their IMO results, and plausibly there are additional techniques in there that I would consider clever.) Then on top of that, you apply a bunch of techniques that are basically just more dakka: Bigger model, higher quality training data, RL on a bigger / higher-quality dataset of problems, more test-time compute.
I don’t expect there’s a fully reliable anti-jailbreaking technique that can be built by applying well-known existing methods with more dakka. If there is, I think I’d have to change my opinion here.
To your other question, I don’t think it necessarily solves the problem of inner (or even outer) misaligned models. It would only be partial progress on one aspect of the alignment problem. Partial progress is still progress, though.
Yeah, the thing to remember with super-kelly strategies is that you are concentrating your expected utility into very improbable worlds where you are extremely wealthy. Which means you need to check that your total wealth in such a world is still significantly less than the total amount of money that exists. It’s no fun to go broke in all but of worlds to give yourself dollars in those worlds, only for that copy of you to find out that is not an amount of dollars that a person can actually reasonably have.