run a program that prints out what our proposed FAI would have done given some situations, show that to our sample people, and check if they actually endorse the proposed course of action.
So, suppose we do this, and we conclude that our FAI is in fact capable of reliably proposing courses of actions that, in general terms, people endorse.
It seems clear to me that’s not enough to show that it will not fuck things up when it comes time to actually implement changes in the real world. Do you disagree? Because back at the beginning of this conversation, it sounded like you were claiming you had in mind a process that was guaranteed not to fuck up, which is what I was skeptical about.
There’s another related proposal [...] we’ll use some evidence about humans to rate the probability that a given utility function is Friendly
Well, I certainly expect that to work better than not using evidence. Beyond that, I’m really not sure what to say about it. Here again… suppose this procedure works wonderfully, and as a consequence of climbing that hill we end up with a consistent set of VNM-rational utility functions that humans reliably endorse when they read about them.
It seems clear to me that’s not enough to show that it will not fuck things up when it comes time to actually implement changes in the real world. Do you disagree?
Now you might reply “Well it’s the best we can do!” and I might agree. As I said earlier, we simply have to accept that we might get it wrong, and do it anyway, because the probability of disaster if we don’t do it is even higher. But let’s not pretend there’s no chance of failure.
yes, the population has broken into political factions
I’m not sure I would describe those subgroups as political factions, necessarily… they’re just people expressing opinions at this stage. But sure, I could imagine analogous political factions.
the whole point of CEV was to employ a fairly simple population-ethical heuristic: “where our wishes cohere rather than interfere”. [..] You humans couldn’t actually agree on much more.
Well, now, this is a different issue. I actually agree with you here, but I was assuming for the sake of argument that the CEV paradigm actually works, and gets a real, worthwhile converged result from G. That is, I’m assuming for the same of comity that G actually would, if they were “more the people they wished to be” and so on and so forth in all the poetic language of the CEV paper, agree on V, and that our value-extractor somehow figures that out because it’s really well-designed.
My point was that it doesn’t follow from that that G as they actually are will agree on V.
(By the way, the whole point of reflective equilibrium is to wash out really stupid ideas like “everyone suffers as I’ve suffered”, which has never actually done anything for anyone.)
Sure, I agree—both that that’s the point of RE, and that ESAIS is a really stupid (though popular) idea.
But reflective equilibrium is a method with an endpoint we approach asymptotically. The degree of reflective equilibrium humans can reliably achieve after being put in a quiet, air-conditioned room for twenty minutes, fed nutritious food and played soothing music for that time, and then asked questions is less than that which we can achieve after ten years or two hundred years.
At some point, we have to define a target of how much reflective equilibrium we expect from our input, and from our evaluators. The further we shift our target away from where we are right now, the more really stupid ideas we will wash out, and the less likely we are to endorse the result. The further we shift it towards where we are, the more stupid ideas we keep, and the more likely we are to endorse the result.
I would certainly hope that a real FAI would understand about how people prefer gradual transitions and not completely overthrow everything all at once to any degree greater than strictly necessary.
I feel like we’re just talking past each other at this point, actually. I’m not talking about how quickly the FAI optimizes the world, I’m talking about whether we are likely to endorse the result of extracting our actual values.
My point was that it doesn’t follow from that that G as they actually are will agree on V.
I’m talking about whether we are likely to endorse the result of extracting our actual values.
OOOOOOOOOOOOOOOOOOOOH. Ah. Ok. That is actually an issue, yes! Sorry I didn’t get what you meant before!
My answer is: that is an open problem, in the sense that we kind of need to know much more about neuro-ethics to answer it. It’s certainly easy to imagine scenarios in which, for instance, the FAI proposes to make all humans total moral exemplars, and as a result all the real humans who secretly like being sinful, even if they don’t endorse it, reject the deal entirely.
Yes, we have several different motivational systems, and the field of machine ethics tends to brush this under the rug by referring to everything as “human values” simply because the machine-ethics folks tend to contrast humans with paper-clippers to make a point about why machine-ethics experts are necessary.
This kind of thing is an example of the consideration that needs to be done to get somewhere. You are correct in saying that if FAI designers want their proposals to be accepted by the public (or even the general body of the educated elite) they need to cater not only to meta-level moral wishes but to actual desires and affections real people feel today. I would certainly argue this is an important component of Friendliness design.
At some point, we have to define a target of how much reflective equilibrium we expect from our input, and from our evaluators. The further we shift our target away from where we are right now, the more really stupid ideas we will wash out, and the less likely we are to endorse the result. The further we shift it towards where we are, the more stupid ideas we keep, and the more likely we are to endorse the result.
This assumes that people are unlikely to endorse smart ideas. I personally disagree: many ideas are difficult to locate in idea-space, but easy to evaluate. Life extension, for example, or marriage for romance.
Because back at the beginning of this conversation, it sounded like you were claiming you had in mind a process that was guaranteed not to fuck up, which is what I was skeptical about.
No, I have not solved AI Friendliness all on my lonesome. That would be a ridiculous claim, a crackpot sort of claim. I just have a bunch of research notes that, even with their best possible outcome, leave lots of open questions and remaining issues.
Now you might reply “Well it’s the best we can do!” and I might agree. As I said earlier, we simply have to accept that we might get it wrong, and do it anyway, because the probability of disaster if we don’t do it is even higher. But let’s not pretend there’s no chance of failure.
Certainly there’s a chance of failure. I just think there’s a lot we can and should do to reduce that chance. The potential rewards are simply too great not to.
So, suppose we do this, and we conclude that our FAI is in fact capable of reliably proposing courses of actions that, in general terms, people endorse.
It seems clear to me that’s not enough to show that it will not fuck things up when it comes time to actually implement changes in the real world. Do you disagree? Because back at the beginning of this conversation, it sounded like you were claiming you had in mind a process that was guaranteed not to fuck up, which is what I was skeptical about.
Well, I certainly expect that to work better than not using evidence. Beyond that, I’m really not sure what to say about it. Here again… suppose this procedure works wonderfully, and as a consequence of climbing that hill we end up with a consistent set of VNM-rational utility functions that humans reliably endorse when they read about them.
It seems clear to me that’s not enough to show that it will not fuck things up when it comes time to actually implement changes in the real world. Do you disagree?
Now you might reply “Well it’s the best we can do!” and I might agree. As I said earlier, we simply have to accept that we might get it wrong, and do it anyway, because the probability of disaster if we don’t do it is even higher. But let’s not pretend there’s no chance of failure.
I’m not sure I would describe those subgroups as political factions, necessarily… they’re just people expressing opinions at this stage. But sure, I could imagine analogous political factions.
Well, now, this is a different issue. I actually agree with you here, but I was assuming for the sake of argument that the CEV paradigm actually works, and gets a real, worthwhile converged result from G. That is, I’m assuming for the same of comity that G actually would, if they were “more the people they wished to be” and so on and so forth in all the poetic language of the CEV paper, agree on V, and that our value-extractor somehow figures that out because it’s really well-designed.
My point was that it doesn’t follow from that that G as they actually are will agree on V.
Sure, I agree—both that that’s the point of RE, and that ESAIS is a really stupid (though popular) idea.
But reflective equilibrium is a method with an endpoint we approach asymptotically. The degree of reflective equilibrium humans can reliably achieve after being put in a quiet, air-conditioned room for twenty minutes, fed nutritious food and played soothing music for that time, and then asked questions is less than that which we can achieve after ten years or two hundred years.
At some point, we have to define a target of how much reflective equilibrium we expect from our input, and from our evaluators. The further we shift our target away from where we are right now, the more really stupid ideas we will wash out, and the less likely we are to endorse the result. The further we shift it towards where we are, the more stupid ideas we keep, and the more likely we are to endorse the result.
I feel like we’re just talking past each other at this point, actually. I’m not talking about how quickly the FAI optimizes the world, I’m talking about whether we are likely to endorse the result of extracting our actual values.
(sigh) Yeah, OK. Tapping out now.
OOOOOOOOOOOOOOOOOOOOH. Ah. Ok. That is actually an issue, yes! Sorry I didn’t get what you meant before!
My answer is: that is an open problem, in the sense that we kind of need to know much more about neuro-ethics to answer it. It’s certainly easy to imagine scenarios in which, for instance, the FAI proposes to make all humans total moral exemplars, and as a result all the real humans who secretly like being sinful, even if they don’t endorse it, reject the deal entirely.
Yes, we have several different motivational systems, and the field of machine ethics tends to brush this under the rug by referring to everything as “human values” simply because the machine-ethics folks tend to contrast humans with paper-clippers to make a point about why machine-ethics experts are necessary.
This kind of thing is an example of the consideration that needs to be done to get somewhere. You are correct in saying that if FAI designers want their proposals to be accepted by the public (or even the general body of the educated elite) they need to cater not only to meta-level moral wishes but to actual desires and affections real people feel today. I would certainly argue this is an important component of Friendliness design.
This assumes that people are unlikely to endorse smart ideas. I personally disagree: many ideas are difficult to locate in idea-space, but easy to evaluate. Life extension, for example, or marriage for romance.
No, I have not solved AI Friendliness all on my lonesome. That would be a ridiculous claim, a crackpot sort of claim. I just have a bunch of research notes that, even with their best possible outcome, leave lots of open questions and remaining issues.
Certainly there’s a chance of failure. I just think there’s a lot we can and should do to reduce that chance. The potential rewards are simply too great not to.