Your trader analogy made me think, you’d ideally want to have a training period with lots of honeypots and surveillance, so that the trader learns that crime doesn’t pay.
This suggests some level-3 ideas I didn’t see in your post:
In addition to improving the reward model, you could also alter the training environment by inserting honeypots.
On the surveillance side, maybe it’s more effective to concentrate your compute in order to catch sneaky behavior. So, say, after every N times it performs some RL task, you use some weak filter to identify the k most suspicious rollouts and then spend a bunch of compute on running an “auditing” process that checks for reward hacking.
These are basically the same as Redwood’s control ideas, but have added benefits in a training context, since you can catch and correct bad behavior as it’s being learned. (Ideally—before the AI is very good at it.)
The quote below felt potentially related, but I’m not sure I understood it.
adversarially hardened models where the reward model plays an adversarial zero-sum game with a red-teaming model
Nice post!
Your trader analogy made me think, you’d ideally want to have a training period with lots of honeypots and surveillance, so that the trader learns that crime doesn’t pay.
This suggests some level-3 ideas I didn’t see in your post:
In addition to improving the reward model, you could also alter the training environment by inserting honeypots.
On the surveillance side, maybe it’s more effective to concentrate your compute in order to catch sneaky behavior. So, say, after every N times it performs some RL task, you use some weak filter to identify the k most suspicious rollouts and then spend a bunch of compute on running an “auditing” process that checks for reward hacking.
These are basically the same as Redwood’s control ideas, but have added benefits in a training context, since you can catch and correct bad behavior as it’s being learned. (Ideally—before the AI is very good at it.)
The quote below felt potentially related, but I’m not sure I understood it.
Could you explain how this works?