Changing selection pressures to align with intended behaviors: This might involve making training objectives more robust, iterating against held-out evaluation signals, or trying to overwrite the AI’s motivations at the end of training with high-quality, aligned training data.
Is another broad direction for this increasing the intelligence of the reward models? A fun hypothetical I’ve heard is to imagine replacing the reward model used during training with Redwood Research (the organization).
So we imagine that during training, whenever an RL reward must be given or not given, a virtual copy of all Redwood Research employees is spun up on the training servers and has one subjective week to decide whether/how to give the RL reward, with their existing knowledge, access to the transcript of the model completing the task, access to data from previous such decisions, ability to use whatever internals-based probes are available, etc.
My instinct is this would help a lot.
To the extent trusted or controlled AI systems can approximate this thought experiment (or do even better!), that seems great for safety and we should use them to do so. People use the term “automated alignment research” but this sort of this seems like an example of automated safety work that isn’t “research”.
More speculative thoughts:
Perhaps we end up in a regime where this is possible, but expensive enough that we can only afford a bit of it.
If that’s the case, the more sample-efficient the RL is, the better off we are — since as the total number of RL rewards we need decreases, we can afford to invest more in the quality of each one.
Even more speculative thoughts:
As an extreme example, you could imagine an AI that’s as sample-efficient as a human child. Suppose an AI company had a virtual copy of von Neumann’s brain age 5, which they could run at 10x speed, and had to raise him to become an intent-aligned adult.
The resources at their disposal would in principle enable a lot of oversight and thoughtful parenting and teaching, so in principle they could afford to do pretty well at this task for a while. It’s like a school with a 100-1 teacher-student ratio. They might fail in practice, of course.
Once his outputs become sufficiently hard to check, they might have a lot more difficulty, e.g. maybe it’s hard to tell whether his advice on the Manhattan Project is subtly wrong in a way designed to sabotage the project.