if the AIs learn the narrow skills such as setting up appropriate RL environments
My impression is that, while “set up an appropriate RL environment, given that you have deep domain expertise” is a narrow skill, you can’t actually escape the part where you need to understand the particulars of the domain (unless your domain is one of the few where a simple non-lossy success metric exists). Whether you can build a curriculum to teach a domain to an AI if neither you nor the AI have significant experience in that domain is a major crux for me.
(capturing lessons/puzzles from personal experiences of AI instances)
I am extremely not convinced that this is a single learnable cross-domain skill rather than something where the skill of figuring out which lessons are the correct ones varies on a domain-by-domain basis
and debugging of training issues
There’s a form of this that seems generalizable (e.g. noticing training stability issues, knowing which loss schedule / hparams to use, avoiding known footguns in the common RL frameworks) and a form that doesn’t (e.g. noticing when the student model has learned a skill that scores well due to quirks of your RL environment which won’t transfer to the real world).
But it would still be much more effective than when it doesn’t happen at all, and AI labor scales well.
Yes. But the bit where recursive self improvement is a major differentiating factor because an AI can learn a non-transferable “get better at the general skill of teaching yourself things” seems to me like it’s on shakier ground.
I think the schleppy path of “learn skills by intentionally training on those specific skills” will be the main way AIs get better in the next few years. Which has many implications, particularly something like “frontier models can get good at any skill you care to name, but they can’t get good at every skill”.
When you go through a textbook, there are confusions you can notice but not yet immediately resolve, and these could plausibly become RLVR tasks. To choose and formulate some puzzle as an RLVR task, the AI would need to already understand the context of that puzzle, but then training on that task makes it ready to understand more. Setting priorities for learning seems like a general skill that adapts to various situations as you learn to understand them better. As with human learning, the ordering from more familiar lessons to deeper expertise would happen naturally for AI instances as they engage in active learning about their situations.
I think the schleppy path of “learn skills by intentionally training on those specific skills” will be the main way AIs get better in the next few years.
So my point is that automating just this thing might be sufficient, and the perception of its schleppiness is exactly the claim of its generalizability. You need expertise sufficient to choose and formulate the puzzles, not yet sufficient to solve them, and this generation-verification gap keeps moving the frontier of understanding forward, step by step, but potentially indefinitely.
You need expertise sufficient to choose and formulate the puzzles, not yet sufficient to solve them, and this generation-verification gap keeps moving the frontier of understanding forward, step by step, but potentially indefinitely.
Seems plausible. I note that
That world is bottlenecked on compute resources you can pour into training, particularly if AIs remain much less sample efficient than humans when learning new tasks.
Training up the first AI on a skill by doing the generation-verification-gap-shuffle is much more expensive than training up later AIs once you can cheaply run inference on an AI that already has the skill, and training a later AI to delegate to one specialized in this skill might be cheaper still.
This world still sees an explosion of recursively AI capabilities, but those capability gains are not localized to a single AI agent
My impression is that, while “set up an appropriate RL environment, given that you have deep domain expertise” is a narrow skill, you can’t actually escape the part where you need to understand the particulars of the domain (unless your domain is one of the few where a simple non-lossy success metric exists). Whether you can build a curriculum to teach a domain to an AI if neither you nor the AI have significant experience in that domain is a major crux for me.
I am extremely not convinced that this is a single learnable cross-domain skill rather than something where the skill of figuring out which lessons are the correct ones varies on a domain-by-domain basis
There’s a form of this that seems generalizable (e.g. noticing training stability issues, knowing which loss schedule / hparams to use, avoiding known footguns in the common RL frameworks) and a form that doesn’t (e.g. noticing when the student model has learned a skill that scores well due to quirks of your RL environment which won’t transfer to the real world).
Yes. But the bit where recursive self improvement is a major differentiating factor because an AI can learn a non-transferable “get better at the general skill of teaching yourself things” seems to me like it’s on shakier ground.
I think the schleppy path of “learn skills by intentionally training on those specific skills” will be the main way AIs get better in the next few years. Which has many implications, particularly something like “frontier models can get good at any skill you care to name, but they can’t get good at every skill”.
When you go through a textbook, there are confusions you can notice but not yet immediately resolve, and these could plausibly become RLVR tasks. To choose and formulate some puzzle as an RLVR task, the AI would need to already understand the context of that puzzle, but then training on that task makes it ready to understand more. Setting priorities for learning seems like a general skill that adapts to various situations as you learn to understand them better. As with human learning, the ordering from more familiar lessons to deeper expertise would happen naturally for AI instances as they engage in active learning about their situations.
So my point is that automating just this thing might be sufficient, and the perception of its schleppiness is exactly the claim of its generalizability. You need expertise sufficient to choose and formulate the puzzles, not yet sufficient to solve them, and this generation-verification gap keeps moving the frontier of understanding forward, step by step, but potentially indefinitely.
Seems plausible. I note that
That world is bottlenecked on compute resources you can pour into training, particularly if AIs remain much less sample efficient than humans when learning new tasks.
Training up the first AI on a skill by doing the generation-verification-gap-shuffle is much more expensive than training up later AIs once you can cheaply run inference on an AI that already has the skill, and training a later AI to delegate to one specialized in this skill might be cheaper still.
This world still sees an explosion of recursively AI capabilities, but those capability gains are not localized to a single AI agent