How so? Its still a big old pile of vectors generated by SGD, just with a cost function that we can use to calculate stuff. But we may not understand what this cost function means, especially in terms of the models native ontology. Sure, maybe it will have natural abstractions interpretable by circuits style reasoning. But that’s equally true of any current NN.
Looking at this, it is clear that this is a utility maximiser. And those are dangerous by default. Another worrying part is that there should be some “safety guardrails” in the cost function. But what kind of terms could make it safe? Nothing purely internal, at least not without crippling the AI’s utility. For a utility function that’s pointing to something in the real world, there’s two issues.
Humans are very complex, and it seems tricky to point them out in a world model.
The AI’s world model is potentially a shifting inscrutable mess. How do we reliably point to anything in it?
In general, I’m a bit unsure about how much of an interpretability advantage we get from slicing the model up into chunks. If the pieces are trained separately, then we can reason about each part individually based on its training procedure. In the optimistic scenario, this means that the computation happening in the part of the system labeled “world model” is actually something humans would call world modelling. This is definitely helpful for interpretability. But the alternative possibility is that we get one or more mesa-optimizers, which seems less interpretable.
I for one am moderately optimistic that the world-model can actually remain “just” a world-model (and not a secret deceptive world-optimizer), and that the value function can actually remain “just” a value function (and not a secret deceptive world-optimizer), and so on, for reasons in my post Thoughts on safety in predictive learning—particularly the idea that the world-model data structure / algorithm can be relatively narrowly tailored to being a world-model, and the value function data structure / algorithm can be relatively narrowly tailored to being a value function, etc.
Since LeCun’s architecture is together a kind of optimizer (I agree with Algon that it’s probably a utility maximizer) then the emergence of additional mesa-optimizers seems less likely.
We expect optimization to emerge because it’s a powerful algorithm for SGD to stumble on that outcompetes the alternatives. But if the system is already an optimizer, then where is that selection pressure coming from to make another one?
I think it’s easier to interpret than model-free RL (provided the line between model and actor is maintained through training, which is an assumption LeCun makes but doesn’t defend) because it’s doing explicit model-based planning, so there’s a clear causal explanation for why the agent took a particular action—because it predicted that it would lead to a specific low-cost world state. It still might be hard to decode the world state representation, but much easier than decoding what the agent is trying to do from the activations of a policy network.
Not obvious to me that it will be a utility maximizer, but definitely dangerous by default. In a world where this architecture is dominant, we probably have to give up on getting intent alignment and fall back to safety guarantees like “well it behaved well in all of our adversarial simulations, and we have a powerful supervising process that will turn it off if it the plans look fishy”. Not my ideal world, but an important world to consider.
The configurator dynamically modulates the cost function, so the agent is not guaranteed to have the same cost function over time, hence can be dutch booked / violate VNM axioms.
Good point. But at any given time, its doing EV calculations to decide its actions. Even if it modulates itself by picking amongst a variety of utility functions, its actions are still influenced by explicit EV calcs. If I understand TurnTrout’s work correctly, that alone is enough to make the agent power seeking. Which is dangerous by default.
How so? Its still a big old pile of vectors generated by SGD, just with a cost function that we can use to calculate stuff. But we may not understand what this cost function means, especially in terms of the models native ontology. Sure, maybe it will have natural abstractions interpretable by circuits style reasoning. But that’s equally true of any current NN.
Looking at this, it is clear that this is a utility maximiser. And those are dangerous by default. Another worrying part is that there should be some “safety guardrails” in the cost function. But what kind of terms could make it safe? Nothing purely internal, at least not without crippling the AI’s utility. For a utility function that’s pointing to something in the real world, there’s two issues.
Humans are very complex, and it seems tricky to point them out in a world model.
The AI’s world model is potentially a shifting inscrutable mess. How do we reliably point to anything in it?
In general, I’m a bit unsure about how much of an interpretability advantage we get from slicing the model up into chunks. If the pieces are trained separately, then we can reason about each part individually based on its training procedure. In the optimistic scenario, this means that the computation happening in the part of the system labeled “world model” is actually something humans would call world modelling. This is definitely helpful for interpretability. But the alternative possibility is that we get one or more mesa-optimizers, which seems less interpretable.
I for one am moderately optimistic that the world-model can actually remain “just” a world-model (and not a secret deceptive world-optimizer), and that the value function can actually remain “just” a value function (and not a secret deceptive world-optimizer), and so on, for reasons in my post Thoughts on safety in predictive learning—particularly the idea that the world-model data structure / algorithm can be relatively narrowly tailored to being a world-model, and the value function data structure / algorithm can be relatively narrowly tailored to being a value function, etc.
Since LeCun’s architecture is together a kind of optimizer (I agree with Algon that it’s probably a utility maximizer) then the emergence of additional mesa-optimizers seems less likely.
We expect optimization to emerge because it’s a powerful algorithm for SGD to stumble on that outcompetes the alternatives. But if the system is already an optimizer, then where is that selection pressure coming from to make another one?
it’s coming from the fact that every module wants to be an optimizer of something in order to do its job
Interesting, I wonder how the dynamics of a multiple mesa-optimizer system would play out (if it’s possible).
I think it’s easier to interpret than model-free RL (provided the line between model and actor is maintained through training, which is an assumption LeCun makes but doesn’t defend) because it’s doing explicit model-based planning, so there’s a clear causal explanation for why the agent took a particular action—because it predicted that it would lead to a specific low-cost world state. It still might be hard to decode the world state representation, but much easier than decoding what the agent is trying to do from the activations of a policy network.
Not obvious to me that it will be a utility maximizer, but definitely dangerous by default. In a world where this architecture is dominant, we probably have to give up on getting intent alignment and fall back to safety guarantees like “well it behaved well in all of our adversarial simulations, and we have a powerful supervising process that will turn it off if it the plans look fishy”. Not my ideal world, but an important world to consider.
It decides its actions via minimising a cost function. How’s that not isomorphic to a utility maximiser?
The configurator dynamically modulates the cost function, so the agent is not guaranteed to have the same cost function over time, hence can be dutch booked / violate VNM axioms.
Good point. But at any given time, its doing EV calculations to decide its actions. Even if it modulates itself by picking amongst a variety of utility functions, its actions are still influenced by explicit EV calcs. If I understand TurnTrout’s work correctly, that alone is enough to make the agent power seeking. Which is dangerous by default.