David Silver: And so what we need is really a way to build a system which can adapt and which can say, well, which one of these is really the important thing to optimize in this situation. And so another way to say that is, wouldn’t it be great if we could have systems where, you know, a human maybe specifies, what they want, but that gets translated into, a set of different numbers that the system can then optimize for itself completely autonomously.
Hannah Fry: So, okay, an example then let’s say I said, okay, I want to be healthier this year. And that’s kind of a bit nebulous, a bit fuzzy. But what you’re saying here is that that can be translated into a series of metrics like resting heart rate or BMI or whatever it might be. And a combination of those metrics could then be used as a reward for reinforcement learning that, if I understood that correctly?
Silver: Absolutely correctly.
Fry: Are we talking about one metric, though? Are we talking about a combination here?
Silver: The general idea would be that you’ve got one thing which the human wants like two optimize my health. And and then the system can learn for itself. Like which rewards help you to be healthier. And so that can be like a combination of numbers that adapts over time. So it could be that it starts off saying, okay, well, you know, right now it’s your resting heart rate that really matters. And then later you might get some feedback saying, hang on. You know, I really don’t just care about that, I care about my anxiety level or something. And then that includes that into the mixture. And and based on on on feedback it could actually adapt. So one way to say this is that a very small amount of human data can allow the system to generate goals for itself that enable a vast amount of learning from experience.
Fry: Because this is where the real questions of alignment come in, right? I mean, if you said, for instance, let’s do a reinforcement learning algorithm that just minimizes my resting heart rate. I mean, quite quickly, zero is is like a good minimization strategy that which would achieve its objective, just not maybe quite in the way that you wanted it to. I mean, obviously you really want to avoid that kind of scenario. So how do you have confidence that the metrics that you’re choosing aren’t creating additional problems?
Silver: One way you can do this is to leverage the the same answer, which has been so effective so far elsewhere in AI, which is at that level, you can make use of of some human input. If it’s a human goal that we’re optimizing, then we probably at that level need to measure, you know, and say, well, you know, human gives feedback to say, actually, you know, I’m starting to feel uncomfortable. And in fact, while I don’t want to claim that we have the answers, and I think there’s an enormous amount of research to get this right and make sure that this kind of thing is safe, it could actually help in certain ways in terms of this kind of safety and adaptation. There’s this famous example of paving over the whole world with paperclips when, a system’s been asked to make as many paperclips as possible. If you have a system which which is really its overall goal is to, you know, support human, well-being. And, and it gets that feedback from humans about and it understands their, their distress signals and their happiness signals and so forth. The moment it starts to, you know, do create too many paperclips and starts to cause people distress, it would adapt that that combination and it would choose a different combination and start to optimize for something which isn’t going to pave over the world with paperclips. We’re not there yet. Yeah, but I think there are some, some versions of this which could actually end up not only addressing some of the alignment issues that have been faced by previous approaches to, you know, goal focused systems that maybe even, you know, be, be more adaptive and therefore safer than what we have today.
I was happy to see the progression in what David Silver is saying re what goals AGIs should have:
David Silver, April 10, 2025 (from 35:33 of DeepMind podcast episode Is Human Data Enough? With David Silver):
David Silver: And so what we need is really a way to build a system which can adapt and which can say, well, which one of these is really the important thing to optimize in this situation. And so another way to say that is, wouldn’t it be great if we could have systems where, you know, a human maybe specifies, what they want, but that gets translated into, a set of different numbers that the system can then optimize for itself completely autonomously.
Hannah Fry: So, okay, an example then let’s say I said, okay, I want to be healthier this year. And that’s kind of a bit nebulous, a bit fuzzy. But what you’re saying here is that that can be translated into a series of metrics like resting heart rate or BMI or whatever it might be. And a combination of those metrics could then be used as a reward for reinforcement learning that, if I understood that correctly?
Silver: Absolutely correctly.
Fry: Are we talking about one metric, though? Are we talking about a combination here?
Silver: The general idea would be that you’ve got one thing which the human wants like two optimize my health. And and then the system can learn for itself. Like which rewards help you to be healthier. And so that can be like a combination of numbers that adapts over time. So it could be that it starts off saying, okay, well, you know, right now it’s your resting heart rate that really matters. And then later you might get some feedback saying, hang on. You know, I really don’t just care about that, I care about my anxiety level or something. And then that includes that into the mixture. And and based on on on feedback it could actually adapt. So one way to say this is that a very small amount of human data can allow the system to generate goals for itself that enable a vast amount of learning from experience.
Fry: Because this is where the real questions of alignment come in, right? I mean, if you said, for instance, let’s do a reinforcement learning algorithm that just minimizes my resting heart rate. I mean, quite quickly, zero is is like a good minimization strategy that which would achieve its objective, just not maybe quite in the way that you wanted it to. I mean, obviously you really want to avoid that kind of scenario. So how do you have confidence that the metrics that you’re choosing aren’t creating additional problems?
Silver: One way you can do this is to leverage the the same answer, which has been so effective so far elsewhere in AI, which is at that level, you can make use of of some human input. If it’s a human goal that we’re optimizing, then we probably at that level need to measure, you know, and say, well, you know, human gives feedback to say, actually, you know, I’m starting to feel uncomfortable. And in fact, while I don’t want to claim that we have the answers, and I think there’s an enormous amount of research to get this right and make sure that this kind of thing is safe, it could actually help in certain ways in terms of this kind of safety and adaptation. There’s this famous example of paving over the whole world with paperclips when, a system’s been asked to make as many paperclips as possible. If you have a system which which is really its overall goal is to, you know, support human, well-being. And, and it gets that feedback from humans about and it understands their, their distress signals and their happiness signals and so forth. The moment it starts to, you know, do create too many paperclips and starts to cause people distress, it would adapt that that combination and it would choose a different combination and start to optimize for something which isn’t going to pave over the world with paperclips. We’re not there yet. Yeah, but I think there are some, some versions of this which could actually end up not only addressing some of the alignment issues that have been faced by previous approaches to, you know, goal focused systems that maybe even, you know, be, be more adaptive and therefore safer than what we have today.