I think “automation of RLVR task creation during model training” is much the same as “self play and autocurricula” here. I agree with calling this “prosaic RSI” because it could lead to something that is superhuman on a variety of tasks, even though it doesn’t improve (the generality of) the underlying learning algorithm (like architecture, optimizer or objective function).
Automating RLVR so fully that it can be used post-deployment would be much more difficult, and also issues with sample efficiency would be more relevant there.
Yes, and another problem with continual (test time) learning is that it likely doesn’t mesh well with the centralized nature of cloud-based AI services where millions of people use the same weights.
Training that would normally want to use full weights usually works almost as well with LoRA when you don’t need too many training steps. As with KV-cache, you can maintain separate LoRA weights for each request/user. It’s the next level of difficulty to get task-specific training to work over so many steps that LoRA (as opposed to full weights) starts becoming a problem. Recurrent state could also probably play this role, and doesn’t need to be nearly as large as the main model, the same as with KV-cache and LoRA weights.
I think “automation of RLVR task creation during model training” is much the same as “self play and autocurricula” here. I agree with calling this “prosaic RSI” because it could lead to something that is superhuman on a variety of tasks, even though it doesn’t improve (the generality of) the underlying learning algorithm (like architecture, optimizer or objective function).
Yes, and another problem with continual (test time) learning is that it likely doesn’t mesh well with the centralized nature of cloud-based AI services where millions of people use the same weights.
Training that would normally want to use full weights usually works almost as well with LoRA when you don’t need too many training steps. As with KV-cache, you can maintain separate LoRA weights for each request/user. It’s the next level of difficulty to get task-specific training to work over so many steps that LoRA (as opposed to full weights) starts becoming a problem. Recurrent state could also probably play this role, and doesn’t need to be nearly as large as the main model, the same as with KV-cache and LoRA weights.