Some people have suggested that a lot of the danger of training a powerful AI comes from reinforcement learning. Given an objective, RL will reinforce any method of achieving the objective that the model tries and finds to be successful including things like deceiving us or increasing its power.
If this were the case, then if we want to build a model with capability level X, it might make sense to try to train that model either without RL or with as little RL as possible. For example, we could attempt to achieve the objective using imitation learning instead.
However, if, for example, the alternate was imitation learning, it would be possible to push back and argue that this is still a black-box that uses gradient descent so we would have no way of knowing that the internals were safe.
Would this be likely to lead to a safer model or is the risk mostly independent of RL?
Notes:
Obviously, someone could probably then apply RL to any such model in order to produce a more powerful model. And having a safe model of capacity level X doesn’t save you from someone else building an unsafe model of capacity X unless you’ve got a plan of how to use the model to change the strategic situation.
But I think it’s worth considering this question all the same, just in case some of the governance interventions end up bearing fruit and we do end up with the option to accept less powerful systems.
[Question] Does reducing the amount of RL for a given capability level make AI safer?
Some people have suggested that a lot of the danger of training a powerful AI comes from reinforcement learning. Given an objective, RL will reinforce any method of achieving the objective that the model tries and finds to be successful including things like deceiving us or increasing its power.
If this were the case, then if we want to build a model with capability level X, it might make sense to try to train that model either without RL or with as little RL as possible. For example, we could attempt to achieve the objective using imitation learning instead.
However, if, for example, the alternate was imitation learning, it would be possible to push back and argue that this is still a black-box that uses gradient descent so we would have no way of knowing that the internals were safe.
Would this be likely to lead to a safer model or is the risk mostly independent of RL?
Notes:
Obviously, someone could probably then apply RL to any such model in order to produce a more powerful model. And having a safe model of capacity level X doesn’t save you from someone else building an unsafe model of capacity X unless you’ve got a plan of how to use the model to change the strategic situation.
But I think it’s worth considering this question all the same, just in case some of the governance interventions end up bearing fruit and we do end up with the option to accept less powerful systems.