This sounds like interesting work, and it might be useful for safety.
Suppose you design an agent harness so that you are applying a steering vector for caution and safety when it’s time to make decisions, and applying none or some other vector when it’s time to carefully work out the details of what is possible. If you do that right, you get safety with no capability loss.
Of course, it seems pretty likely that if you did that in a sloppy way, the unsteered thinking could easily get a hold of the whole system if it in any sense wanted to.
This sounds like interesting work, and it might be useful for safety.
Suppose you design an agent harness so that you are applying a steering vector for caution and safety when it’s time to make decisions, and applying none or some other vector when it’s time to carefully work out the details of what is possible. If you do that right, you get safety with no capability loss.
Of course, it seems pretty likely that if you did that in a sloppy way, the unsteered thinking could easily get a hold of the whole system if it in any sense wanted to.