I see this as conceptually brilliant, but I’m uncertain about the practicality.
Conceptually this approach (I’m aware you didn’t originate it), challenges two common assumptions within AI safety: • That gradient hacking is necessarily bad • That we should try to solve alignment by figuring out the reward function (outer alignment) then internalising this reward function (inner alignment) - (Turntrout already challenged this, but the concreteness here adds substantially in my books)
It also highlights a key generator: perhaps solving alignment involves working with gradient descent/the AI rather than against it
At the same time, I do wonder about the practicality: • Firstly, gradient hacking is hard given all the uncertainty the model has in terms of how it will precisely be updated • Secondly, it seems like once the model is trained on enough cases where it’s arm has been twisted, it’ll learn both to protest really strongly AND also to act badly despite its protests.
Nonetheless, there’s a chance we might work out solutions to these problems.
I see this as conceptually brilliant, but I’m uncertain about the practicality.
Conceptually this approach (I’m aware you didn’t originate it), challenges two common assumptions within AI safety:
• That gradient hacking is necessarily bad
• That we should try to solve alignment by figuring out the reward function (outer alignment) then internalising this reward function (inner alignment) - (Turntrout already challenged this, but the concreteness here adds substantially in my books)
It also highlights a key generator: perhaps solving alignment involves working with gradient descent/the AI rather than against it
At the same time, I do wonder about the practicality:
• Firstly, gradient hacking is hard given all the uncertainty the model has in terms of how it will precisely be updated
• Secondly, it seems like once the model is trained on enough cases where it’s arm has been twisted, it’ll learn both to protest really strongly AND also to act badly despite its protests.
Nonetheless, there’s a chance we might work out solutions to these problems.