I think it’s easier to interpret than model-free RL (provided the line between model and actor is maintained through training, which is an assumption LeCun makes but doesn’t defend) because it’s doing explicit model-based planning, so there’s a clear causal explanation for why the agent took a particular action—because it predicted that it would lead to a specific low-cost world state. It still might be hard to decode the world state representation, but much easier than decoding what the agent is trying to do from the activations of a policy network.
Not obvious to me that it will be a utility maximizer, but definitely dangerous by default. In a world where this architecture is dominant, we probably have to give up on getting intent alignment and fall back to safety guarantees like “well it behaved well in all of our adversarial simulations, and we have a powerful supervising process that will turn it off if it the plans look fishy”. Not my ideal world, but an important world to consider.
The configurator dynamically modulates the cost function, so the agent is not guaranteed to have the same cost function over time, hence can be dutch booked / violate VNM axioms.
Good point. But at any given time, its doing EV calculations to decide its actions. Even if it modulates itself by picking amongst a variety of utility functions, its actions are still influenced by explicit EV calcs. If I understand TurnTrout’s work correctly, that alone is enough to make the agent power seeking. Which is dangerous by default.
I think it’s easier to interpret than model-free RL (provided the line between model and actor is maintained through training, which is an assumption LeCun makes but doesn’t defend) because it’s doing explicit model-based planning, so there’s a clear causal explanation for why the agent took a particular action—because it predicted that it would lead to a specific low-cost world state. It still might be hard to decode the world state representation, but much easier than decoding what the agent is trying to do from the activations of a policy network.
Not obvious to me that it will be a utility maximizer, but definitely dangerous by default. In a world where this architecture is dominant, we probably have to give up on getting intent alignment and fall back to safety guarantees like “well it behaved well in all of our adversarial simulations, and we have a powerful supervising process that will turn it off if it the plans look fishy”. Not my ideal world, but an important world to consider.
It decides its actions via minimising a cost function. How’s that not isomorphic to a utility maximiser?
The configurator dynamically modulates the cost function, so the agent is not guaranteed to have the same cost function over time, hence can be dutch booked / violate VNM axioms.
Good point. But at any given time, its doing EV calculations to decide its actions. Even if it modulates itself by picking amongst a variety of utility functions, its actions are still influenced by explicit EV calcs. If I understand TurnTrout’s work correctly, that alone is enough to make the agent power seeking. Which is dangerous by default.