Mainly because it seems really hard. If we can do something that seems that hard, we probably learned something new.
There is also a mechanistic analogy. Think about what a jailbreak fundamentally is: an adversarial example. Some tuned input that results in an “incorrect” output. In terms of the overall alignment problem, why can’t we just make an AI care about people’s wellbeing by giving rewards during training? Well, the AI might be able to think of an adversarial state of the world that “feels” better to its own internal values, but doesn’t actually contain any people.
lots of things are very hard. making models do IMO problems is very hard, for example.
i guess there are two main questions. one is, why would we expect a method that makes LMs adversarially robust to also work on AGI? and second, even supposing we can know the technique to generalize to AGI, why would we expect the ability to adversarially robustify a reward model to help make an inner-misaligned model pursue the right goal?
Abstract answer: Maybe it doesn’t transfer from LM’s to AGI, but advances the state of knowledge in the field in a way that makes it easier to find something that works on AGI. Maybe it doesn’t transfer to (say) a pure RL agent, but it’s easier to make a sufficiently good LM into an AGI than it looks. Maybe it does just transfer. Obviously there are also outcomes where it turns out to be useless, I’m just saying it looks positive in expectation.
Concrete answer: Adversarial examples have been with us throughout the history of neural nets, and basically the only thing we’ve really found to deal with them is “generate adversarial examples during training and train against them”, and even that doesn’t really work.
If we look at the things that let LMs do IMO problems, the really fundamental innovations (which were pre-existing, I think) are “RL on chain of thought” and “make some kind of good scaffold for the search process that lets you save partial insights instead of going fully parallel on the entire problem” and maybe “LLM as verifier”. (Disclaimer: I don’t know everything the labs did to achieve their IMO results, and plausibly there are additional techniques in there that I would consider clever.) Then on top of that, you apply a bunch of techniques that are basically just more dakka: Bigger model, higher quality training data, RL on a bigger / higher-quality dataset of problems, more test-time compute.
I don’t expect there’s a fully reliable anti-jailbreaking technique that can be built by applying well-known existing methods with more dakka. If there is, I think I’d have to change my opinion here.
To your other question, I don’t think it necessarily solves the problem of inner (or even outer) misaligned models. It would only be partial progress on one aspect of the alignment problem. Partial progress is still progress, though.
Mainly because it seems really hard. If we can do something that seems that hard, we probably learned something new.
There is also a mechanistic analogy. Think about what a jailbreak fundamentally is: an adversarial example. Some tuned input that results in an “incorrect” output. In terms of the overall alignment problem, why can’t we just make an AI care about people’s wellbeing by giving rewards during training? Well, the AI might be able to think of an adversarial state of the world that “feels” better to its own internal values, but doesn’t actually contain any people.
lots of things are very hard. making models do IMO problems is very hard, for example.
i guess there are two main questions. one is, why would we expect a method that makes LMs adversarially robust to also work on AGI? and second, even supposing we can know the technique to generalize to AGI, why would we expect the ability to adversarially robustify a reward model to help make an inner-misaligned model pursue the right goal?
Abstract answer: Maybe it doesn’t transfer from LM’s to AGI, but advances the state of knowledge in the field in a way that makes it easier to find something that works on AGI. Maybe it doesn’t transfer to (say) a pure RL agent, but it’s easier to make a sufficiently good LM into an AGI than it looks. Maybe it does just transfer. Obviously there are also outcomes where it turns out to be useless, I’m just saying it looks positive in expectation.
Concrete answer: Adversarial examples have been with us throughout the history of neural nets, and basically the only thing we’ve really found to deal with them is “generate adversarial examples during training and train against them”, and even that doesn’t really work.
If we look at the things that let LMs do IMO problems, the really fundamental innovations (which were pre-existing, I think) are “RL on chain of thought” and “make some kind of good scaffold for the search process that lets you save partial insights instead of going fully parallel on the entire problem” and maybe “LLM as verifier”. (Disclaimer: I don’t know everything the labs did to achieve their IMO results, and plausibly there are additional techniques in there that I would consider clever.) Then on top of that, you apply a bunch of techniques that are basically just more dakka: Bigger model, higher quality training data, RL on a bigger / higher-quality dataset of problems, more test-time compute.
I don’t expect there’s a fully reliable anti-jailbreaking technique that can be built by applying well-known existing methods with more dakka. If there is, I think I’d have to change my opinion here.
To your other question, I don’t think it necessarily solves the problem of inner (or even outer) misaligned models. It would only be partial progress on one aspect of the alignment problem. Partial progress is still progress, though.