Great post, thanks for sharing. Here’s my core concern about LeCun’s worldview, then two other thoughts:
The intrinsic cost module (IC) is where the basic behavioral nature of the agent is defined. It is where basic behaviors can be indirectly specified. For a robot, these terms would include obvious proprioceptive measurements corresponding to “pain”, “hunger”, and “instinctive fears”, measuring such things as external force overloads, dangerous electrical, chemical, or thermal environments, excessive power consumption, low levels of energy reserves in the power source, etc.
They may also include basic drives to help the agent learn basic skills or accomplish its missions. For example, a legged robot may comprise an intrinsic cost to drive it to stand up and walk. This may also include social drives such as seeking the company of humans, finding interactions with humans and praises from them rewarding, and finding their pain unpleasant (akin to empathy in social animals). Other intrinsic behavioral drives, such as curiosity, or taking actions that have an observable impact, may be included to maximize the diversity of situations with which the world model is trained (Gottlieb et al., 2013)
The IC can be seen as playing a role similar to that of the amygdala in the mammalian brain and similar structures in other vertebrates. To prevent a kind of behavioral collapse or an uncontrolled drift towards bad behaviors, the IC must be immutable and not subject to learning (nor to external modifications).
Most of the paper instead focuses on the challenges of building accurate, multimodal predictive world models. This seems entirely necessary to continue advancing AI, but the primary focus on predictive capabilities and minimizing of the challenges in learning human values worries me.
If anybody has good sources about LeCun’s views on AI safety and value learning, I’d be interested.
success of model-free RL in complex video game environments like StarCraft and Dota 2
Do we expect model-free RL to succeed in domains where you can’t obtain incredible amounts of data thanks to e.g. self-play? Having a purely predictive world model seems better able to utilize self-supervised predictive objective functions, and to generalize to many possible goals that use a single world model. (Not to mention the potential alignment benefits of a more modular system.) Is model-free RL simply a fluke that learns heuristics by playing games against itself, or are there reasons to believe it will succeed on more important tasks?
Since the whole architecture is trained end-to-end with gradient descent
I don’t think this is what he meant, though I might’ve missed something. The world model could be trained with the self-supervised objective functions of language and vision models, as well as perhaps large labeled datasets and games via self-play. On the other hand, the actor must learn to adapt to many different tasks very quickly, but could potentially use few-shot learning or fine-tuning to that end. The more natural architecture would seem to be modules that treat each other as black boxes and can be swapped out relatively easily.
The full conversation is a bit long and difficult to skim. I haven’t finished reading it myself, but in it LeCun links to an article he co-authored for Scientific American which argues x-risk from AI misalignment isn’t something people should worry about. (He’s more concerned about misuse risks.) Here’s a quote from it:
We dramatically overestimate the threat of an accidental AI takeover, because we tend to conflate intelligence with the drive to achieve dominance. [...] But intelligence per se does not generate the drive for domination, any more than horns do.”
My read of LeCun in that conversation is that he doesn’t think in terms of outer alignment / value alignment at all, but rather in terms of implementing a series of “safeguards” that allow humans to recover if the AI behaves poorly (See Steven Byrnes’ summary).
I think this paper helps clarify why he believes this—he had something like this architecture in mind, and so outer alignment seemed basically impossible. Independently, he believes it’s unnecessary because the obvious safeguards will prove sufficient.
Ah you’re right, the paper never directly says the architecture is trained end-to-end—updated the post, thanks for the catch.
He might still mean something closer to end-to-end learning, because
The world model is differentiable w.r.t the cost (Figure 2), suggesting it isn’t trained purely using self-supervised learning.
The configurator needs to learn to modulate the world model, the cost, and the actor; it seems unlikely that this can be done well if these are all swappable black boxes. So there is likely some phase of co-adaptation between configurator, actor, cost, and world model.
Great post, thanks for sharing. Here’s my core concern about LeCun’s worldview, then two other thoughts:
This is the paper’s treatment of the outer alignment problem. It says models should have basic drives and behaviors that are specified directly by humans and not trained. The paper doesn’t mention the challenges of reward specification or the potential for learning human preferences. It doesn’t discuss our normative systems or even the kinds of abstractions that humans care about. I don’t understand why he doesn’t see the challenges with specifying human values.
Most of the paper instead focuses on the challenges of building accurate, multimodal predictive world models. This seems entirely necessary to continue advancing AI, but the primary focus on predictive capabilities and minimizing of the challenges in learning human values worries me.
If anybody has good sources about LeCun’s views on AI safety and value learning, I’d be interested.
Do we expect model-free RL to succeed in domains where you can’t obtain incredible amounts of data thanks to e.g. self-play? Having a purely predictive world model seems better able to utilize self-supervised predictive objective functions, and to generalize to many possible goals that use a single world model. (Not to mention the potential alignment benefits of a more modular system.) Is model-free RL simply a fluke that learns heuristics by playing games against itself, or are there reasons to believe it will succeed on more important tasks?
I don’t think this is what he meant, though I might’ve missed something. The world model could be trained with the self-supervised objective functions of language and vision models, as well as perhaps large labeled datasets and games via self-play. On the other hand, the actor must learn to adapt to many different tasks very quickly, but could potentially use few-shot learning or fine-tuning to that end. The more natural architecture would seem to be modules that treat each other as black boxes and can be swapped out relatively easily.
There’s a conversation LeCun had with Stuart Russell and a few others in a Facebook comment thread back in 2019, arguing about instrumental convergence.
The full conversation is a bit long and difficult to skim. I haven’t finished reading it myself, but in it LeCun links to an article he co-authored for Scientific American which argues x-risk from AI misalignment isn’t something people should worry about. (He’s more concerned about misuse risks.) Here’s a quote from it:
My read of LeCun in that conversation is that he doesn’t think in terms of outer alignment / value alignment at all, but rather in terms of implementing a series of “safeguards” that allow humans to recover if the AI behaves poorly (See Steven Byrnes’ summary).
I think this paper helps clarify why he believes this—he had something like this architecture in mind, and so outer alignment seemed basically impossible. Independently, he believes it’s unnecessary because the obvious safeguards will prove sufficient.
Ah you’re right, the paper never directly says the architecture is trained end-to-end—updated the post, thanks for the catch.
He might still mean something closer to end-to-end learning, because
The world model is differentiable w.r.t the cost (Figure 2), suggesting it isn’t trained purely using self-supervised learning.
The configurator needs to learn to modulate the world model, the cost, and the actor; it seems unlikely that this can be done well if these are all swappable black boxes. So there is likely some phase of co-adaptation between configurator, actor, cost, and world model.