I think you’re misunderstanding the reason I’m making this argument.
The reason you in fact don’t fix the discriminator and find the “maximally human like face” by optimizing the input is that this gives you unrealistic images. The technique is not used precisely because people know it would fail. That such a technique would fail is my whole point! It’s meant to be an example that illustrates the danger of hooking up powerful optimizers to models that are not robust to distributional shift.
Yes, in fact GANs work well if you avoid doing this and stick to using them in distribution, but then all you have is a model that learns the distribution of inputs that you’ve given the model. The same is true of diffusion models: they just give you a Langevin diffusion-like way of generating samples from a probability distribution. In the end, you can simplify all of these models as taking many pairs of (image, description) and then learning distributions such as P(image), P(image | description), etc.
This method, however, has several limitations if you try to generalize it to training a superhuman intelligence:
The stakes for training a superhuman intelligence are much higher, so your tolerance of making mistakes is much lower. How much do you trust Stable Diffusion to give you a good image if the description you give it is sufficiently out of distribution? It would make a “good attempt” by some definition, but it’s quite unlikely that it would give you what you actually wanted. That’s the whole reason there’s a whole field of prompt engineering to get these models to do what you want!
If we end up in such a situation with respect to a distribution of P(action | world state), then I think we would be in a bad position. This is exacerbated by the second point in the list.
If all you’re doing is learning the distribution P(action humans would take | world state), then you’re fundamentally limited in the actions you can take by actions humans would actually think of. You might still be superhuman in the sense that you run much faster and with much less energy etc. than humans do, but you won’t come up with plans that humans wouldn’t have come up with on their own, even if they would be very good plans; and you won’t be able to avoid mistakes that humans would actually make in practice. Therefore any model that is trained strictly in this way has capability limitations that it won’t be able to get past.
One way of trying to ameliorate this problem is to have a standard reinforcement learning agent which is trained with an objective that is not intended to be robust out of distribution, but then combine it with the distribution P such that P regularizes the actions taken by the model. Quantilizers are based on this idea, and unfortunately they also have problems: for instance, “build an expected utility maximizer” is an action that probably would get a relatively high score from P, because it’s indeed an action that a human could try.
The distributional shift that a general intelligence has to be robust to is likely much greater than the one that image generators need to be robust to. So on top of the high stakes we have in this scenario per point (1), we also have a situation where we need to get the model to behave well even with relatively big distributional shifts. I think no model that exists right now is actually as robust to distributional shift as much as we would want an AGI to be, or in fact even as much as we would want a self-driving car to be.
Unlike any other problem we’ve tried to tackle in ML, learning the human action distribution is so complicated that it’s plausible an AI trained to do this ends up with weird strategies that involve deceiving humans. In fact, any model that learns the human action distribution should also be capable of learning the distribution of actions that would look human to human evaluators instead of actually being human. If you already have the human action distribution P, you can actually take your model itself as a free parameter Q and maximize the expected score of the model Q under the distribution P: in other words, you explicitly optimize to deceive humans by learning the distribution humans would think is best rather than the distribution that is actually best.
No such problems arise in a setting such as learning human faces, because P is much more complicated than actually solving the problem you’re being asked to solve, so there’s no reason for gradient descent to find the deception strategy instead of just doing what you’re being asked to do.
Overall, I don’t see how the fact that GANs or diffusion models can generate realistic human faces is an argument that we should be less worried about alignment failures. The only way in which this concern would show up is if your argument was “ML can’t learn complicated functions”, but I don’t think this was ever the reason people have given for being worried about alignment failures: indeed, if ML couldn’t learn complicated functions, there would be no reason for worrying about ML at all.
The technique is not used precisely because people know it would fail. That such a technique would fail is my whole point! It’s meant to be an example that illustrates the danger of hooking up powerful optimizers to models that are not robust to distributional shift.
And it’s a point directed at a strawman in some sense. You could say that I am nit picking, but diffusion models—and more specifically—diffusion planners simply do not fail in this particular way. You can expend arbitrary optimization power on inference of a trained image diffusion model and that doesn’t do anything like maximize for faciness. You can spend arbitrary optimization power training the generator for a diffusion model and that doesn’t do anything like maximize for faciness it just generates increasingly realistic images. You can spend arbitrary optimization power training the discriminator for a diffusion model and that doesn’t do anything like maximize for faciness, it just increases the ability to hit specific targets.
Diffusion planners optimize something like P(plan_trajectory | discriminator_target), where the plan_trajectory generative model is a denoising model (trained from rollouts of a world model) and direct analogy of the image diffusion denoiser, and discriminator_target model is the direct analogy of an image recognition model (except the ‘images’ it sees are trajectories).
The correct analogy for a dangerous diffusion planner is not failure by adversarially optimizing for faciness, the analogy is using an underpowered discriminator that can’t recognize/distinguish good and bad trajectories from a superpowered generator—so the analogy is generating an image of a corpse when you asked for an image of a happy human. The diffusion planner equivalent of generating a grotesque unrealistic face is simply generating an unrealistic (and non dangerous) failure of a plan. My main argument here is that the ‘faciness’ analogy is simply the wrong analogy (for diffusion planning—and anything similar).
How much do you trust Stable Diffusion to give you a good image if the description you give it is sufficiently out of distribution? It would make a “good attempt” by some definition, but it’s quite unlikely that it would give you what you actually wanted. That’s the whole reason there’s a whole field of prompt engineering to get these models to do what you want!
Sure—but prompt engineering is mostly irrelevant, as the discriminator output for AGI planning is probably just a scalar utility function, not a text description (although the latter may also be useful for more narrow AI). And once again, the dangerous out of distribution failure analogy here is not a maximal ‘faciness’ image for a human face goal, it’s an ultra realistic image of a skeleton.
If all you’re doing is learning the distribution P(action humans would take | world state), then you’re fundamentally limited in the actions you can take by actions humans would actually think of.
Diffusion planners optimize P(plan_trajectory | discriminator_target), where the plan_trajectory model is trained to denoise trajectories which can come from human data (behavioral cloning) or rollouts from a learned world model (more relevant for later AGI). Using the latter the agent is obviously not limited to actions human think of.
One way of trying to ameliorate this problem is to have a standard reinforcement learning agent
That is one way, but not the only way and perhaps not the best. The more natural approach is for the learned world model to include an (naturally approximate) predictive model of the full planning agent, so it just predict it’s own actions. This does cause the plan trajectories to become increasingly biased towards the discriminator target goals, but one should be able to just adjust for that and it seems necessary anyway (you really don’t want to spend alot of compute training on rollouts that lead to bad futures).
The distributional shift that a general intelligence has to be robust to is likely much greater than the one that image generators need to be robust to.
Maybe? But I’m not aware of strong evidence/arguments for this, it seems to be mostly conjecture. The hard part of image diffusion models is not the discriminator, it’s the generator: we had superhuman discriminators long before we had image denoiser generators capable of generating realistic images. So this could be good news for alignment in that it’s a bit of evidence that generating realistic plans is much harder than evaluating the utility of plan trajectories.
But to be clear, I do agree there is a distributional shift that occurs as you scale up the generative planning model—but the obvious solution is to just keep improving/retraining the discriminator in tandem. Assuming a static discriminator is a bit of a strawman. And if the argument is it will be too late to modify once we have AGI, the solution to that is just simboxing (and also we can spend far more time generating plans and evaluating them then actually executing them).
Unlike any other problem we’ve tried to tackle in ML, learning the human action distribution is so complicated that it’s plausible an AI trained to do this ends up with weird strategies that involve deceiving humans. In fact, any model that learns the human action distribution should also be capable of learning the distribution of actions that would look human to human evaluators instead of actually being human.
Maybe? But in fact we don’t really observe that yet when training a foundation model via behavioral cloning (and reinforcement learning) in minecraft.
And anyway, to the extent that’s a problem we can use simboxing.
Overall, I don’t see how the fact that GANs or diffusion models can generate realistic human faces is an argument that we should be less worried about alignment failures.
That depends on to the extent that you had already predicted that we’d have things like stable diffusion in 2022, without AGI.
The only way in which this concern would show up is if your argument was “ML can’t learn complicated functions”,
Many many people thought this in 2007 when the sequences were written—it was the default view (and generally true at the time).
but I don’t think this was ever the reason people have given for being worried about alignment failures: indeed, if ML couldn’t learn complicated functions, there would be no reason for worrying about ML at all.
EY around 2011 seemed to have standard pre-DL views on ANN-style ML techniques, and certainly doubted they would be capable of learning complex human values robustly enough to instill into AI as a safe utility function (this is well documented ). He had little doubt that later AGI—after recursive self improvement—would be able to learn complicated functions like human values, but by then of course it’s obviously too late.
The DL update in some sense is “superhuman pattern recognition is easier than we thought, and comes much earlier than AGI”, which mostly seems to be good news for alignment, because safety is mostly about recognizing bad planning trajectories.
I think we’re not communicating clearly with each other. If you have the time, I would be enthusiastic about discussing these potential disagreements in a higher frequency format (such as IRC or Discord or whatever), but if you don’t have the time I don’t think the format of LW comments is suitable for hashing out the kind of disagreement we’re having here. I’ll make an effort to respond regardless of your preferences about this, but it’s rather exhausting for me and I don’t expect it to be successful if done in this format.
That’s in part why we had this conversation in the format that we did instead of going back-and-forth on Twitter or LW, as I expected those modes of communication would be poorly suited to discussing this particular subject. I think the comments on Katja’s original post have proven me right about that.
I think you’re misunderstanding the reason I’m making this argument.
The reason you in fact don’t fix the discriminator and find the “maximally human like face” by optimizing the input is that this gives you unrealistic images. The technique is not used precisely because people know it would fail. That such a technique would fail is my whole point! It’s meant to be an example that illustrates the danger of hooking up powerful optimizers to models that are not robust to distributional shift.
Yes, in fact GANs work well if you avoid doing this and stick to using them in distribution, but then all you have is a model that learns the distribution of inputs that you’ve given the model. The same is true of diffusion models: they just give you a Langevin diffusion-like way of generating samples from a probability distribution. In the end, you can simplify all of these models as taking many pairs of (image, description) and then learning distributions such as P(image), P(image | description), etc.
This method, however, has several limitations if you try to generalize it to training a superhuman intelligence:
The stakes for training a superhuman intelligence are much higher, so your tolerance of making mistakes is much lower. How much do you trust Stable Diffusion to give you a good image if the description you give it is sufficiently out of distribution? It would make a “good attempt” by some definition, but it’s quite unlikely that it would give you what you actually wanted. That’s the whole reason there’s a whole field of prompt engineering to get these models to do what you want!
If we end up in such a situation with respect to a distribution of P(action | world state), then I think we would be in a bad position. This is exacerbated by the second point in the list.
If all you’re doing is learning the distribution P(action humans would take | world state), then you’re fundamentally limited in the actions you can take by actions humans would actually think of. You might still be superhuman in the sense that you run much faster and with much less energy etc. than humans do, but you won’t come up with plans that humans wouldn’t have come up with on their own, even if they would be very good plans; and you won’t be able to avoid mistakes that humans would actually make in practice. Therefore any model that is trained strictly in this way has capability limitations that it won’t be able to get past.
One way of trying to ameliorate this problem is to have a standard reinforcement learning agent which is trained with an objective that is not intended to be robust out of distribution, but then combine it with the distribution P such that P regularizes the actions taken by the model. Quantilizers are based on this idea, and unfortunately they also have problems: for instance, “build an expected utility maximizer” is an action that probably would get a relatively high score from P, because it’s indeed an action that a human could try.
The distributional shift that a general intelligence has to be robust to is likely much greater than the one that image generators need to be robust to. So on top of the high stakes we have in this scenario per point (1), we also have a situation where we need to get the model to behave well even with relatively big distributional shifts. I think no model that exists right now is actually as robust to distributional shift as much as we would want an AGI to be, or in fact even as much as we would want a self-driving car to be.
Unlike any other problem we’ve tried to tackle in ML, learning the human action distribution is so complicated that it’s plausible an AI trained to do this ends up with weird strategies that involve deceiving humans. In fact, any model that learns the human action distribution should also be capable of learning the distribution of actions that would look human to human evaluators instead of actually being human. If you already have the human action distribution P, you can actually take your model itself as a free parameter Q and maximize the expected score of the model Q under the distribution P: in other words, you explicitly optimize to deceive humans by learning the distribution humans would think is best rather than the distribution that is actually best.
No such problems arise in a setting such as learning human faces, because P is much more complicated than actually solving the problem you’re being asked to solve, so there’s no reason for gradient descent to find the deception strategy instead of just doing what you’re being asked to do.
Overall, I don’t see how the fact that GANs or diffusion models can generate realistic human faces is an argument that we should be less worried about alignment failures. The only way in which this concern would show up is if your argument was “ML can’t learn complicated functions”, but I don’t think this was ever the reason people have given for being worried about alignment failures: indeed, if ML couldn’t learn complicated functions, there would be no reason for worrying about ML at all.
And it’s a point directed at a strawman in some sense. You could say that I am nit picking, but diffusion models—and more specifically—diffusion planners simply do not fail in this particular way. You can expend arbitrary optimization power on inference of a trained image diffusion model and that doesn’t do anything like maximize for faciness. You can spend arbitrary optimization power training the generator for a diffusion model and that doesn’t do anything like maximize for faciness it just generates increasingly realistic images. You can spend arbitrary optimization power training the discriminator for a diffusion model and that doesn’t do anything like maximize for faciness, it just increases the ability to hit specific targets.
Diffusion planners optimize something like P(plan_trajectory | discriminator_target), where the plan_trajectory generative model is a denoising model (trained from rollouts of a world model) and direct analogy of the image diffusion denoiser, and discriminator_target model is the direct analogy of an image recognition model (except the ‘images’ it sees are trajectories).
The correct analogy for a dangerous diffusion planner is not failure by adversarially optimizing for faciness, the analogy is using an underpowered discriminator that can’t recognize/distinguish good and bad trajectories from a superpowered generator—so the analogy is generating an image of a corpse when you asked for an image of a happy human. The diffusion planner equivalent of generating a grotesque unrealistic face is simply generating an unrealistic (and non dangerous) failure of a plan. My main argument here is that the ‘faciness’ analogy is simply the wrong analogy (for diffusion planning—and anything similar).
Sure—but prompt engineering is mostly irrelevant, as the discriminator output for AGI planning is probably just a scalar utility function, not a text description (although the latter may also be useful for more narrow AI). And once again, the dangerous out of distribution failure analogy here is not a maximal ‘faciness’ image for a human face goal, it’s an ultra realistic image of a skeleton.
Diffusion planners optimize P(plan_trajectory | discriminator_target), where the plan_trajectory model is trained to denoise trajectories which can come from human data (behavioral cloning) or rollouts from a learned world model (more relevant for later AGI). Using the latter the agent is obviously not limited to actions human think of.
That is one way, but not the only way and perhaps not the best. The more natural approach is for the learned world model to include an (naturally approximate) predictive model of the full planning agent, so it just predict it’s own actions. This does cause the plan trajectories to become increasingly biased towards the discriminator target goals, but one should be able to just adjust for that and it seems necessary anyway (you really don’t want to spend alot of compute training on rollouts that lead to bad futures).
Maybe? But I’m not aware of strong evidence/arguments for this, it seems to be mostly conjecture. The hard part of image diffusion models is not the discriminator, it’s the generator: we had superhuman discriminators long before we had image denoiser generators capable of generating realistic images. So this could be good news for alignment in that it’s a bit of evidence that generating realistic plans is much harder than evaluating the utility of plan trajectories.
But to be clear, I do agree there is a distributional shift that occurs as you scale up the generative planning model—but the obvious solution is to just keep improving/retraining the discriminator in tandem. Assuming a static discriminator is a bit of a strawman. And if the argument is it will be too late to modify once we have AGI, the solution to that is just simboxing (and also we can spend far more time generating plans and evaluating them then actually executing them).
Maybe? But in fact we don’t really observe that yet when training a foundation model via behavioral cloning (and reinforcement learning) in minecraft.
And anyway, to the extent that’s a problem we can use simboxing.
That depends on to the extent that you had already predicted that we’d have things like stable diffusion in 2022, without AGI.
Many many people thought this in 2007 when the sequences were written—it was the default view (and generally true at the time).
EY around 2011 seemed to have standard pre-DL views on ANN-style ML techniques, and certainly doubted they would be capable of learning complex human values robustly enough to instill into AI as a safe utility function (this is well documented ). He had little doubt that later AGI—after recursive self improvement—would be able to learn complicated functions like human values, but by then of course it’s obviously too late.
The DL update in some sense is “superhuman pattern recognition is easier than we thought, and comes much earlier than AGI”, which mostly seems to be good news for alignment, because safety is mostly about recognizing bad planning trajectories.
I think we’re not communicating clearly with each other. If you have the time, I would be enthusiastic about discussing these potential disagreements in a higher frequency format (such as IRC or Discord or whatever), but if you don’t have the time I don’t think the format of LW comments is suitable for hashing out the kind of disagreement we’re having here. I’ll make an effort to respond regardless of your preferences about this, but it’s rather exhausting for me and I don’t expect it to be successful if done in this format.
That’s in part why we had this conversation in the format that we did instead of going back-and-forth on Twitter or LW, as I expected those modes of communication would be poorly suited to discussing this particular subject. I think the comments on Katja’s original post have proven me right about that.