The technique is not used precisely because people know it would fail. That such a technique would fail is my whole point! It’s meant to be an example that illustrates the danger of hooking up powerful optimizers to models that are not robust to distributional shift.
And it’s a point directed at a strawman in some sense. You could say that I am nit picking, but diffusion models—and more specifically—diffusion planners simply do not fail in this particular way. You can expend arbitrary optimization power on inference of a trained image diffusion model and that doesn’t do anything like maximize for faciness. You can spend arbitrary optimization power training the generator for a diffusion model and that doesn’t do anything like maximize for faciness it just generates increasingly realistic images. You can spend arbitrary optimization power training the discriminator for a diffusion model and that doesn’t do anything like maximize for faciness, it just increases the ability to hit specific targets.
Diffusion planners optimize something like P(plan_trajectory | discriminator_target), where the plan_trajectory generative model is a denoising model (trained from rollouts of a world model) and direct analogy of the image diffusion denoiser, and discriminator_target model is the direct analogy of an image recognition model (except the ‘images’ it sees are trajectories).
The correct analogy for a dangerous diffusion planner is not failure by adversarially optimizing for faciness, the analogy is using an underpowered discriminator that can’t recognize/distinguish good and bad trajectories from a superpowered generator—so the analogy is generating an image of a corpse when you asked for an image of a happy human. The diffusion planner equivalent of generating a grotesque unrealistic face is simply generating an unrealistic (and non dangerous) failure of a plan. My main argument here is that the ‘faciness’ analogy is simply the wrong analogy (for diffusion planning—and anything similar).
How much do you trust Stable Diffusion to give you a good image if the description you give it is sufficiently out of distribution? It would make a “good attempt” by some definition, but it’s quite unlikely that it would give you what you actually wanted. That’s the whole reason there’s a whole field of prompt engineering to get these models to do what you want!
Sure—but prompt engineering is mostly irrelevant, as the discriminator output for AGI planning is probably just a scalar utility function, not a text description (although the latter may also be useful for more narrow AI). And once again, the dangerous out of distribution failure analogy here is not a maximal ‘faciness’ image for a human face goal, it’s an ultra realistic image of a skeleton.
If all you’re doing is learning the distribution P(action humans would take | world state), then you’re fundamentally limited in the actions you can take by actions humans would actually think of.
Diffusion planners optimize P(plan_trajectory | discriminator_target), where the plan_trajectory model is trained to denoise trajectories which can come from human data (behavioral cloning) or rollouts from a learned world model (more relevant for later AGI). Using the latter the agent is obviously not limited to actions human think of.
One way of trying to ameliorate this problem is to have a standard reinforcement learning agent
That is one way, but not the only way and perhaps not the best. The more natural approach is for the learned world model to include an (naturally approximate) predictive model of the full planning agent, so it just predict it’s own actions. This does cause the plan trajectories to become increasingly biased towards the discriminator target goals, but one should be able to just adjust for that and it seems necessary anyway (you really don’t want to spend alot of compute training on rollouts that lead to bad futures).
The distributional shift that a general intelligence has to be robust to is likely much greater than the one that image generators need to be robust to.
Maybe? But I’m not aware of strong evidence/arguments for this, it seems to be mostly conjecture. The hard part of image diffusion models is not the discriminator, it’s the generator: we had superhuman discriminators long before we had image denoiser generators capable of generating realistic images. So this could be good news for alignment in that it’s a bit of evidence that generating realistic plans is much harder than evaluating the utility of plan trajectories.
But to be clear, I do agree there is a distributional shift that occurs as you scale up the generative planning model—but the obvious solution is to just keep improving/retraining the discriminator in tandem. Assuming a static discriminator is a bit of a strawman. And if the argument is it will be too late to modify once we have AGI, the solution to that is just simboxing (and also we can spend far more time generating plans and evaluating them then actually executing them).
Unlike any other problem we’ve tried to tackle in ML, learning the human action distribution is so complicated that it’s plausible an AI trained to do this ends up with weird strategies that involve deceiving humans. In fact, any model that learns the human action distribution should also be capable of learning the distribution of actions that would look human to human evaluators instead of actually being human.
Maybe? But in fact we don’t really observe that yet when training a foundation model via behavioral cloning (and reinforcement learning) in minecraft.
And anyway, to the extent that’s a problem we can use simboxing.
Overall, I don’t see how the fact that GANs or diffusion models can generate realistic human faces is an argument that we should be less worried about alignment failures.
That depends on to the extent that you had already predicted that we’d have things like stable diffusion in 2022, without AGI.
The only way in which this concern would show up is if your argument was “ML can’t learn complicated functions”,
Many many people thought this in 2007 when the sequences were written—it was the default view (and generally true at the time).
but I don’t think this was ever the reason people have given for being worried about alignment failures: indeed, if ML couldn’t learn complicated functions, there would be no reason for worrying about ML at all.
EY around 2011 seemed to have standard pre-DL views on ANN-style ML techniques, and certainly doubted they would be capable of learning complex human values robustly enough to instill into AI as a safe utility function (this is well documented ). He had little doubt that later AGI—after recursive self improvement—would be able to learn complicated functions like human values, but by then of course it’s obviously too late.
The DL update in some sense is “superhuman pattern recognition is easier than we thought, and comes much earlier than AGI”, which mostly seems to be good news for alignment, because safety is mostly about recognizing bad planning trajectories.
I think we’re not communicating clearly with each other. If you have the time, I would be enthusiastic about discussing these potential disagreements in a higher frequency format (such as IRC or Discord or whatever), but if you don’t have the time I don’t think the format of LW comments is suitable for hashing out the kind of disagreement we’re having here. I’ll make an effort to respond regardless of your preferences about this, but it’s rather exhausting for me and I don’t expect it to be successful if done in this format.
That’s in part why we had this conversation in the format that we did instead of going back-and-forth on Twitter or LW, as I expected those modes of communication would be poorly suited to discussing this particular subject. I think the comments on Katja’s original post have proven me right about that.
And it’s a point directed at a strawman in some sense. You could say that I am nit picking, but diffusion models—and more specifically—diffusion planners simply do not fail in this particular way. You can expend arbitrary optimization power on inference of a trained image diffusion model and that doesn’t do anything like maximize for faciness. You can spend arbitrary optimization power training the generator for a diffusion model and that doesn’t do anything like maximize for faciness it just generates increasingly realistic images. You can spend arbitrary optimization power training the discriminator for a diffusion model and that doesn’t do anything like maximize for faciness, it just increases the ability to hit specific targets.
Diffusion planners optimize something like P(plan_trajectory | discriminator_target), where the plan_trajectory generative model is a denoising model (trained from rollouts of a world model) and direct analogy of the image diffusion denoiser, and discriminator_target model is the direct analogy of an image recognition model (except the ‘images’ it sees are trajectories).
The correct analogy for a dangerous diffusion planner is not failure by adversarially optimizing for faciness, the analogy is using an underpowered discriminator that can’t recognize/distinguish good and bad trajectories from a superpowered generator—so the analogy is generating an image of a corpse when you asked for an image of a happy human. The diffusion planner equivalent of generating a grotesque unrealistic face is simply generating an unrealistic (and non dangerous) failure of a plan. My main argument here is that the ‘faciness’ analogy is simply the wrong analogy (for diffusion planning—and anything similar).
Sure—but prompt engineering is mostly irrelevant, as the discriminator output for AGI planning is probably just a scalar utility function, not a text description (although the latter may also be useful for more narrow AI). And once again, the dangerous out of distribution failure analogy here is not a maximal ‘faciness’ image for a human face goal, it’s an ultra realistic image of a skeleton.
Diffusion planners optimize P(plan_trajectory | discriminator_target), where the plan_trajectory model is trained to denoise trajectories which can come from human data (behavioral cloning) or rollouts from a learned world model (more relevant for later AGI). Using the latter the agent is obviously not limited to actions human think of.
That is one way, but not the only way and perhaps not the best. The more natural approach is for the learned world model to include an (naturally approximate) predictive model of the full planning agent, so it just predict it’s own actions. This does cause the plan trajectories to become increasingly biased towards the discriminator target goals, but one should be able to just adjust for that and it seems necessary anyway (you really don’t want to spend alot of compute training on rollouts that lead to bad futures).
Maybe? But I’m not aware of strong evidence/arguments for this, it seems to be mostly conjecture. The hard part of image diffusion models is not the discriminator, it’s the generator: we had superhuman discriminators long before we had image denoiser generators capable of generating realistic images. So this could be good news for alignment in that it’s a bit of evidence that generating realistic plans is much harder than evaluating the utility of plan trajectories.
But to be clear, I do agree there is a distributional shift that occurs as you scale up the generative planning model—but the obvious solution is to just keep improving/retraining the discriminator in tandem. Assuming a static discriminator is a bit of a strawman. And if the argument is it will be too late to modify once we have AGI, the solution to that is just simboxing (and also we can spend far more time generating plans and evaluating them then actually executing them).
Maybe? But in fact we don’t really observe that yet when training a foundation model via behavioral cloning (and reinforcement learning) in minecraft.
And anyway, to the extent that’s a problem we can use simboxing.
That depends on to the extent that you had already predicted that we’d have things like stable diffusion in 2022, without AGI.
Many many people thought this in 2007 when the sequences were written—it was the default view (and generally true at the time).
EY around 2011 seemed to have standard pre-DL views on ANN-style ML techniques, and certainly doubted they would be capable of learning complex human values robustly enough to instill into AI as a safe utility function (this is well documented ). He had little doubt that later AGI—after recursive self improvement—would be able to learn complicated functions like human values, but by then of course it’s obviously too late.
The DL update in some sense is “superhuman pattern recognition is easier than we thought, and comes much earlier than AGI”, which mostly seems to be good news for alignment, because safety is mostly about recognizing bad planning trajectories.
I think we’re not communicating clearly with each other. If you have the time, I would be enthusiastic about discussing these potential disagreements in a higher frequency format (such as IRC or Discord or whatever), but if you don’t have the time I don’t think the format of LW comments is suitable for hashing out the kind of disagreement we’re having here. I’ll make an effort to respond regardless of your preferences about this, but it’s rather exhausting for me and I don’t expect it to be successful if done in this format.
That’s in part why we had this conversation in the format that we did instead of going back-and-forth on Twitter or LW, as I expected those modes of communication would be poorly suited to discussing this particular subject. I think the comments on Katja’s original post have proven me right about that.