We want there to be a low probability that there is a situation where a model does something with bad or very bad consequences. (Both in deployment and in training)
In training we do that by limiting the consequences of tokens to only gradient updates in the model, (for pretraining) or by building a secure RL environment where the model can only act in the requirement
After training, model companies have limited control over the environment (especially for open models)
How to functionalize
Some situations are likely some are not, how to formalize
What would a model likely not do, well if we have an example we can ask it
(since in gradient mode we get probs for every token on a known token stream), and we can prove that P(all tokens) = P(all but last)*P(last |all that last) for all tokens and p(last|all but last) is exactly the model prob at position last
If we have example bad baths, we can try to see if there is an input that would make them likely (through gradient descent on the plausiblity of that trajectory)
Then the question is can we generate realistic trajectories containing bad behaviour.
How to formalize
Is there a realistic input where trajectory is bad
Is probably not
for all inputs : plausible for pretrained for all trajectories that are bad the model would not plausibly follow.
Since plausibility comes from our pretrained model (we do not know all plausible inputs) (that would be a world model that is good)
For contemporary systems
We want there to be a low probability that there is a situation where a model does something with bad or very bad consequences. (Both in deployment and in training)
In training we do that by limiting the consequences of tokens to only gradient updates in the model, (for pretraining) or by building a secure RL environment where the model can only act in the requirement
After training, model companies have limited control over the environment (especially for open models)
How to functionalize
Some situations are likely some are not, how to formalize
What would a model likely not do, well if we have an example we can ask it
(since in gradient mode we get probs for every token on a known token stream), and we can prove that P(all tokens) = P(all but last)*P(last |all that last) for all tokens and p(last|all but last) is exactly the model prob at position last
If we have example bad baths, we can try to see if there is an input that would make them likely (through gradient descent on the plausiblity of that trajectory)
Then the question is can we generate realistic trajectories containing bad behaviour.
How to formalize
Is there a realistic input where trajectory is bad
Is probably not
for all inputs : plausible for pretrained for all trajectories that are bad the model would not plausibly follow.
Since plausibility comes from our pretrained model (we do not know all plausible inputs) (that would be a world model that is good)
We do not know all bad trajectories (just some)